DexSim2Real2:Building Explicit World Model for Precise Articulated Object Manipulations

Department of Mechanical Engineering, Tsinghua University
JD Explore Academy

We present DexSim2Real2, a novel robot learning framework designed for precise, goal-conditioned articulated object manipulation with two-finger grippers and dexterous hands. We first build the explicit world model of the target object in a physics simulator through active interaction and then use MPC to search for a long-horizon manipulation trajectory to achieve the desired manipulation goal. Quantitative evaluation of real object manipulation results verifies the effectiveness of our proposed framework for both kinds of end effectors.

Abstract

Articulated object manipulation is ubiquitous in daily life. In this paper, we present DexSim2Real2, a novel robot learning framework for goal-conditioned articulated object manipulation using both two-finger grippers and multi-finger dexterous hands.

The key of our framework is constructing an explicit world model of unseen articulated objects through active one-step interactions. This explicit world model enables sampling-based model predictive control to plan trajectories achieving different manipulation goals without needing human demonstrations or reinforcement learning.

It first predicts an interaction motion using an affordance estimation network trained on self-supervised interaction data or videos of human manipulation from the internet. After executing this interaction on the real robot, the framework constructs a digital twin of the articulated object in simulation based on the two point clouds before and after the interaction. For dexterous multi-finger manipulation, we propose to utilize eigengrasp to reduce the high-dimensional action space, enabling more efficient trajectory searching. Extensive experiments validate the framework's effectiveness for precise articulated object manipulation in both simulation and the real world using a two-finger gripper and a 16-DoF dexterous hand.

The robust generalizability of the explicit world model also enables advanced manipulation strategies, such as manipulating with different tools.

abstract picture

Video

Method

method

Our framework consists of three phases. (1) Given a partial point cloud of an unseen articulated object, in the Interactive Perception phase, we train an affordance prediction module and use it to change the object's joint state through a one-step interaction. Training data can be acquired through self-supervised interaction in simulation or from egocentric human demonstration videos. (2) In the Explicit Physics Model Construction phase, we build a mental model in a physics simulator from the two point clouds. (3) In the Sampling-based Model Predictive Control phase, we use the model to plan a long-horizon trajectory in simulation and then execute the trajectory on the real robot to complete the task. For dexterous hands, an eigengrasp module is needed for dimensionality reduction.

Main Results


For complex manipulation tasks, dexterous hand can complete tasks in fewer steps than 2-finger gripper
conclusion1
We validate effectiveness of EigenGrasp method in reducing the action dimension of a dexterous hand by evaluating three factors: success rate, joint jerk, and algorithm running time.
For scenarios that object may beyond the robot's reach or the gripper cannot fit into the object, we uses a T-shaped tool or a semi-ring to interact with the object.