DexSim2Real2:Building Explicit World Model for Precise Articulated Object Manipulations

Department of Mechanical Engineering, Tsinghua University
JD Explore Academy

We present DexSim2Real2, a novel robot learning framework designed for precise, goal-conditioned articulated object manipulation with two-finger grippers and dexterous hands. We first build the explicit world model of the target object in a physics simulator through active interaction and then use MPC to search for a long-horizon manipulation trajectory to achieve the desired manipulation goal. Quantitative evaluation of real object manipulation results verifies the effectiveness of our proposed framework for both kinds of end effectors.

Abstract

Articulated object manipulation is ubiquitous in daily life. In this paper, we present DexSim2Real2, a novel robot learning framework for goal-conditioned articulated object manipulation using both two-finger grippers and multi-finger dexterous hands.

The key of our framework is constructing an explicit world model of unseen articulated objects through active interactions. This explicit world model enables sampling-based model predictive control to plan trajectories achieving different manipulation goals without needing human demonstrations or reinforcement learning.

It first predicts an interaction motion using an affordance estimation network trained on self-supervised interaction data or videos of human manipulation from the internet. After executing the interactions on the real robot to move the object parts, we propose a novel modeling pipeline based on 3D AIGC to build a digital twin of the object in simulation from multiple frames of observations.

For dexterous multi-finger manipulation, we propose to utilize eigengrasp to reduce the high-dimensional action space, enabling more efficient trajectory searching. Extensive experiments validate the framework's effectiveness for precise articulated object manipulation in both simulation and the real world using a two-finger gripper and a 16-DoF dexterous hand.

The robust generalizability of the explicit world model also enables advanced manipulation strategies, such as manipulating with different tools.

abstract picture

Video

Method

method

Our framework consists of three phases. (1) Given a partial point cloud of an unseen articulated object, in the Interactive Perception phase, we train an affordance prediction module and use it to change the object’s joint state through a one-step interaction. Training data can be acquired through self-supervised interaction in simulation or from egocentric human demonstration videos. (2) In the Explicit Physics Model Construction phase, we build a mental model in a physics simulator from the K+1 frames of observations. (3) In the Sampling-based Model Predictive Control phase, we use the model to plan a long-horizon trajectory in simulation and then execute the trajectory on the real robot to complete the task. For dexterous hands, an eigengrasp module is needed for dimensionality reduction.

Model Reconstruction

method

For each state of the articulated object, we begin by generating an unaligned and unscaled mesh from multi-view RGB images using 3D AIGC. Next, we estimate the scale and pose through differentiable rendering, and segment the aligned mesh into sub-parts. Once segmented point clouds for each state are obtained, we infer movable part segmentation by analyzing differences between frames of point clouds. We estimate the kinematic structure of the mesh, including the part tree hierarchy, joint categories (prismatic or revolute), and joint configurations (axis direction and origin). Finally, we construct a digital twin of the articulated object represented in URDF format, which can be easily loaded into different physics simulators.

Main Results