DeMaVLA is a Vision-Language-Action (VLA) foundation model for generalizable deformable manipulation. It targets real-world bimanual household folding, where robots must handle garments from random initial states across different categories, geometries, materials, and scenes.

The model combines a Qwen3-VL backbone, a layer-aligned pruned action expert, flow-matching action generation, training-time real-time chunking (RTC), and human-in-the-loop DAgger. DeMaVLA is first pre-trained on about 5,000 hours of selected real-world dual-arm demonstrations, then post-trained on mixed folding demonstrations and corrective trajectories collected from real-robot failures.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support