🪣🗄 DragAPart 💻🗃

Learning a Part-Level Motion Prior for Articulated Objects

Ruining Li Chuanxia Zheng Christian Rupprecht Andrea Vedaldi

Visual Geometry Group, University of Oxford

ECCV 2024

Paper Demo Code Data

Our method, DragAPart, learns a motion prior for articulated objects with diffusion models. Differently from prior works that used drags to move an object, each drag in DragAPart represents a part-level interaction, resulting in a physically plausible deformation of the object shape. DragAPart is trained on a new synthetic dataset, Drag-a-Move, for this task, and generalizes well to real data and even unseen categories. The trained model is immediately useful for applications, such as segmenting movable parts and analyzing motion prompted by a drag.

Check this Gradio Demo page to interact with your favorite articulated objects!

Abstract

We introduce DragAPart, a method that, given an image and a set of drags as input, can generate a new image of the same object in a new state, compatible with the action of the drags. Differently from prior works that focused on repositioning objects, DragAPart predicts part-level interactions, such as opening and closing a drawer. We study this problem as a proxy for learning a generalist motion model, not restricted to a specific kinematic structure or object category. To this end, we start from a pre-trained image generator and fine-tune it on a new synthetic dataset, Drag-a-Move, which we introduce. Combined with a new encoding for the drags and dataset randomization, the new model generalizes well to real images and different categories. Compared to prior motion-controlled generators, we demonstrate much better part-level motion understanding.

Examples: Manipulate Real Images from Amazon Berkeley Objects and CO3D

Our model is capable of preserving fine-grained texture details, generating reasonable shades, handling thin structures, compositing multi-part motion, "dreaming" up internal structures of the object, and generalizing to categories not seen during training.

Applications

Motion Analysis

Our model can be used to understand how the movable parts of an object move in response to a drag. Given a 3D mesh with parts pre-segmented and a 3D drag on the mesh, the type of motion (i.e., revolute or prismatic) and the corresponding parameters (e.g., axis of rotation or translation) can be inferred by using the generated images as psuedo ground truth and optimizing the parameters over the image RGB loss. This also indicates that our model is relatively consistent across different viewpoints.

Try it yourself: Move the slider to see the inferred 3D motion.

Input 3D Model

Generated Psuedo Ground Truth

Input 3D Model

Generated Psuedo Ground Truth

Input 3D Model

Generated Psuedo Ground Truth

Moving Part Segmentation

The internal features of our DragAPart model contain richer part-level information than alternatives. We explore how to use them to segment the movable parts of an object in 2D, prompted by a few drags.

Technical Details

(a) Our model takes as input a single RGB image and one or more drags, and generates a second image that reflects the effect of the drags. (b) We propose a novel flow encoder, which enables us to inject the motion control into the latent diffusion model at different resolutions more efficiently. Instead of representing the sparse drags as a flow image and downsizing it with a convolutional network, we assign different channels to different drags and use separate channels to encode the drag source and the drag termination. (c) At inference time, our model generalizes to real data, synthesizing physically-plausible part-level dynamics.

BibTeX

@article{li2024dragapart,
  title     = {DragAPart: Learning a Part-Level Motion Prior for Articulated Objects},
  author    = {Li, Ruining and Zheng, Chuanxia and Rupprecht, Christian and Vedaldi, Andrea},
  journal   = {arXiv preprint arXiv:2403.15382},
  year      = {2024}
}

Acknowledgements

We would like to thank Minghao Chen, Junyu Xie and Laurynas Karazija for insightful discussions. This work is in part supported by a Toshiba Research Studentship and ERC-CoG UNION 101001212.