DexSim2Real2:Building Explicit World Model for Precise Articulated Object Manipulation

DexSim2Real²:Building Explicit World Model for Precise Articulated Object Manipulations

Department of Mechanical Engineering, Tsinghua University
JD Explore Academy

Abstract

Articulated object manipulation is ubiquitous in daily life. In this paper, we present DexSim2Real², a novel robot learning framework for goal-conditioned articulated object manipulation using both two-finger grippers and multi-finger dexterous hands.

The key of our framework is constructing an explicit world model of unseen articulated objects through active interactions. This explicit world model enables sampling-based model predictive control to plan trajectories achieving different manipulation goals without needing human demonstrations or reinforcement learning.

It first predicts an interaction motion using an affordance estimation network trained on self-supervised interaction data or videos of human manipulation from the internet. After executing the interactions on the real robot to move the object parts, we propose a novel modeling pipeline based on 3D AIGC to build a digital twin of the object in simulation from multiple frames of observations.

For dexterous multi-finger manipulation, we propose to utilize eigengrasp to reduce the high-dimensional action space, enabling more efficient trajectory searching. Extensive experiments validate the framework's effectiveness for precise articulated object manipulation in both simulation and the real world using a two-finger gripper and a 16-DoF dexterous hand.

The robust generalizability of the explicit world model also enables advanced manipulation strategies, such as manipulating with different tools.

Method

Our framework consists of three phases. (1) Given a partial point cloud of an unseen articulated object, in the Interactive Perception phase, we train an affordance prediction module and use it to change the object’s joint state through a one-step interaction. Training data can be acquired through self-supervised interaction in simulation or from egocentric human demonstration videos. (2) In the Explicit Physics Model Construction phase, we build a mental model in a physics simulator from the K+1 frames of observations. (3) In the Sampling-based Model Predictive Control phase, we use the model to plan a long-horizon trajectory in simulation and then execute the trajectory on the real robot to complete the task. For dexterous hands, an eigengrasp module is needed for dimensionality reduction.

Model Reconstruction

For each state of the articulated object, we begin by generating an unaligned and unscaled mesh from multi-view RGB images using 3D AIGC. Next, we estimate the scale and pose through differentiable rendering, and segment the aligned mesh into sub-parts. Once segmented point clouds for each state are obtained, we infer movable part segmentation by analyzing differences between frames of point clouds. We estimate the kinematic structure of the mesh, including the part tree hierarchy, joint categories (prismatic or revolute), and joint configurations (axis direction and origin). Finally, we construct a digital twin of the articulated object represented in URDF format, which can be easily loaded into different physics simulators.

Main Results

Visualization of affordance detection results: results of VRB are shown in green arrows, and results of Where2Act are shown in red arrows.

We validate effectiveness of EigenGrasp method in reducing the action dimension of a dexterous hand by evaluating three factors: success rate, joint jerk, and algorithm running time.

Manipulate objects with multiple moveable parts. Our approach shows high accuracy on objects with revolute and prismatic joints.

Open drawer with tools. In real scenarios, the object may be beyond the robot's reach, or the gripper cannot fit into the object's size. Our method can be extended to tool-using cases. As shown in these two sequences, the robot uses a T-shaped tool or a semi-ring to open the small drawer.

DexSim2Real2:Building Explicit World Model for Precise Articulated Object Manipulations