DeMaVLA is a Vision-Language-Action (VLA) foundation model for generalizable deformable manipulation. It targets real-world bimanual household folding, where robots must handle garments from random initial states across different categories, geometries, materials, and scenes.
The model combines a Qwen3-VL backbone, a layer-aligned pruned action expert, flow-matching action generation, training-time real-time chunking (RTC), and human-in-the-loop DAgger. DeMaVLA is first pre-trained on about 5,000 hours of selected real-world dual-arm demonstrations, then post-trained on mixed folding demonstrations and corrective trajectories collected from real-robot failures.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support