Instructions to use morinoppp/comos_predict_14B_GR1_action_cond with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Cosmos
How to use morinoppp/comos_predict_14B_GR1_action_cond with Cosmos:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Cosmos-Predict2.5-14B GR1 Action-Conditioned World Model
Action-conditioned video prediction model fine-tuned from nvidia/Cosmos-Predict2.5-14B on GR1 dual-arm robot teleoperation data.
Given an initial observation frame and a sequence of robot actions, this model predicts future video frames that depict the robot executing those actions.
Model Details
- Base model: Cosmos-Predict2.5-14B (video2world pre-trained)
- Fine-tuning data: GR1 dual-arm robot teleoperation (LeRobot format)
- Action dimension: 29 (left_arm:7 + left_hand:6 + right_arm:7 + right_hand:6 + waist:3)
- Video frames: 13 frames per chunk (1 conditional + 12 predicted)
- Temporal compression: 4x (13 image frames = 4 latent frames)
- Resolution: 480x832
- Network:
cosmos_v1_14B_action_chunk_conditioned(36 blocks, 5120 channels, 40 heads) - Checkpoint format: EMA weights in bf16 (
model_ema_bf16.pt, ~28GB)
Training Configuration
| Parameter | Value |
|---|---|
| GPUs | 8x H200 (140GB) |
| Batch size | 2 per GPU (global=16) |
| Learning rate | 4e-5 |
| Max iterations | 4,000+ |
| Optimizer | AdamW (cosine schedule) |
| Precision | bf16 |
| Episodes per task | 3 (uniform sampling) |
| FSDP shard size | 32 |
| VAE | Wan2.1 |
| Text encoder | Cosmos-Reason1-7B |
| Tokenizer | Qwen2.5-VL-7B-Instruct |
Inference
Prerequisites
This model requires the Cosmos-Predict2.5 codebase and the following dependencies:
- Wan2.1 VAE:
Wan2.1_VAE.pth(14B model includes its own copy) - Text encoder: Cosmos-Reason1-7B
- Tokenizer: Qwen2.5-VL-7B-Instruct
Quick Start
python cosmos_predict2/_src/predict2/action/inference/inference_gr00t.py \
--experiment=cosmos_predict2p5_2B_action_conditioned_gr00t_gr1_customized_13frame_full_16nodes_release_oss \
--ckpt_path=/path/to/model_ema_bf16.pt \
--input_video_root=/path/to/eval_data \
--save_root=/path/to/output \
--resolution 480,832 \
--guidance 0 \
--chunk_size 12 \
--fps_downsample_ratio 2 \
--save_fps 10 \
--vae_path /path/to/Wan2.1_VAE.pth \
--text_encoder_path /path/to/Cosmos-Reason1-7B \
--qwen_path /path/to/Qwen2.5-VL-7B-Instruct \
--experiment_opts net=cosmos_v1_14B_action_chunk_conditioned model.config.fsdp_shard_size=32
Note: The --experiment uses the registered 2B experiment name for Hydra config lookup. The --experiment_opts overrides the network architecture to 14B.
Input Data Format
The inference script expects a directory containing paired files:
episode_XXXXXX.mp4- Input video (first frame used as conditional)episode_XXXXXX_actions.npy- Action array of shape(T, 29)in numpy format
Eval Data Preparation
Use scripts/prepare_gr1_eval_data.py to extract eval episodes from a LeRobot dataset:
python scripts/prepare_gr1_eval_data.py \
--dataset-path /path/to/GR1_robot \
--output-dir /path/to/eval_data \
--num-episodes 100
One-command Eval
bash scripts/eval_gr1_robot.sh /path/to/model_ema_bf16.pt --debug
The eval script auto-detects 14B vs 2B from the checkpoint path and applies the correct network config.
Examples
| Episode | Prediction | Ground Truth |
|---|---|---|
| 000000 | episode_000000.mp4 | episode_000000.mp4 |
| 000008 | episode_000008.mp4 | episode_000008.mp4 |
| 000056 | episode_000056.mp4 | episode_000056.mp4 |
| 000084 | episode_000084.mp4 | episode_000084.mp4 |
| 000087 | episode_000087.mp4 | episode_000087.mp4 |
File Structure
.
βββ model_ema_bf16.pt # Model weights (EMA, bf16, ~28GB)
βββ inference/
β βββ inference_gr00t.py # Main inference script
β βββ inference_pipeline.py # Inference pipeline (ActionVideo2WorldInference)
βββ scripts/
β βββ eval_gr1_robot.sh # One-command eval script (auto 2B/14B)
β βββ prepare_gr1_eval_data.py # Eval data preparation
β βββ convert_distcp_to_pt.py # DCP -> PT checkpoint converter
βββ examples/
βββ predicted/ # Predicted videos
βββ gt/ # Ground truth videos
Limitations
- Trained only on GR1 dual-arm robot data; may not generalize to other embodiments
- Prediction quality degrades for long-horizon generation (many chunks)
- Episodes with fewer than 12 actions cannot be processed (chunk_size=12)
- 14B model requires ~60GB GPU memory for inference (single GPU)
- No classifier-free guidance (guidance=0) works best for this model
License
Apache-2.0 (same as base Cosmos-Predict2.5)
Acknowledgements
- Base model: NVIDIA Cosmos-Predict2.5
- Training data: GR1 teleoperation dataset (LeRobot format)
- VAE: Wan2.1
- Downloads last month
- 53
Model tree for morinoppp/comos_predict_14B_GR1_action_cond
Base model
nvidia/Cosmos-Predict2.5-14B