μ_0: A Scalable 3D Interaction-Trace World Model
Abstract
A scalable world model called μ₀ uses 3D traces to predict smooth trajectories for key interaction points, enabling embodiment-agnostic robot learning without action labels.
World models that capture how actions induce physical change enable scalable robot learning without reliance on embodiment-specific action labels. Pixel-space video models provide broad visual priors but expend model capacity on dense appearance reconstruction, while direct action models require embodiment-specific labels that hinder scalability. We present μ_0, a scalable world model based on 3D traces. Rather than predicting dense pixels or directly modeling actions, μ_0 forecasts smooth 3D trajectories for salient interaction points such as objects, tools, hands, and contact regions, yielding a compact, embodiment-agnostic motion interface. To enable training from diverse video sources, our TraceExtract system automatically extracts 3D supervision by selecting keypoints, constructing globally aligned traces, and associating motion segments with hierarchical language captions. This TraceExtract supervision pretrains μ_0 by combining a pretrained vision-language backbone with a modular trace expert, which represents each query via B-spline control points and predicts future traces. Experiments show that μ_0 outperforms baselines in both 2D and 3D trace prediction, including trace prediction models and tokenized VLM methods. Because μ_0 is frozen and reusable, it can be paired with action experts for downstream robot embodiments. Despite action-free pretraining, the resulting trace-conditioned policies achieve performance competitive with VLA models pretrained with action supervision, such as π_0. These results establish 3D traces as a scalable and transferable representation for cross-embodiment manipulation.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation (2026)
- PointAction: 3D Points as Universal Action Representations for Robot Control (2026)
- GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation (2026)
- WAM4D: Fast 4D World Action Model via Spatial Register Tokens (2026)
- Point Tracking Improves World Action Models (2026)
- ELAN4D: Embodiment-Centric 4D Supervision for Vision-Language-Action Models via Plug-and-Play Adaptation (2026)
- HARP-VLA: Human-Robot Aligned Representation Learning for Vision-Language-Action Model (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.13769 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper