Cosmos-Predict2.5-14B GR1 Action-Conditioned World Model

Action-conditioned video prediction model fine-tuned from nvidia/Cosmos-Predict2.5-14B on GR1 dual-arm robot teleoperation data.

Given an initial observation frame and a sequence of robot actions, this model predicts future video frames that depict the robot executing those actions.

Model Details

  • Base model: Cosmos-Predict2.5-14B (video2world pre-trained)
  • Fine-tuning data: GR1 dual-arm robot teleoperation (LeRobot format)
  • Action dimension: 29 (left_arm:7 + left_hand:6 + right_arm:7 + right_hand:6 + waist:3)
  • Video frames: 13 frames per chunk (1 conditional + 12 predicted)
  • Temporal compression: 4x (13 image frames = 4 latent frames)
  • Resolution: 480x832
  • Network: cosmos_v1_14B_action_chunk_conditioned (36 blocks, 5120 channels, 40 heads)
  • Checkpoint format: EMA weights in bf16 (model_ema_bf16.pt, ~28GB)

Training Configuration

Parameter Value
GPUs 8x H200 (140GB)
Batch size 2 per GPU (global=16)
Learning rate 4e-5
Max iterations 4,000+
Optimizer AdamW (cosine schedule)
Precision bf16
Episodes per task 3 (uniform sampling)
FSDP shard size 32
VAE Wan2.1
Text encoder Cosmos-Reason1-7B
Tokenizer Qwen2.5-VL-7B-Instruct

Inference

Prerequisites

This model requires the Cosmos-Predict2.5 codebase and the following dependencies:

  • Wan2.1 VAE: Wan2.1_VAE.pth (14B model includes its own copy)
  • Text encoder: Cosmos-Reason1-7B
  • Tokenizer: Qwen2.5-VL-7B-Instruct

Quick Start

python cosmos_predict2/_src/predict2/action/inference/inference_gr00t.py \
  --experiment=cosmos_predict2p5_2B_action_conditioned_gr00t_gr1_customized_13frame_full_16nodes_release_oss \
  --ckpt_path=/path/to/model_ema_bf16.pt \
  --input_video_root=/path/to/eval_data \
  --save_root=/path/to/output \
  --resolution 480,832 \
  --guidance 0 \
  --chunk_size 12 \
  --fps_downsample_ratio 2 \
  --save_fps 10 \
  --vae_path /path/to/Wan2.1_VAE.pth \
  --text_encoder_path /path/to/Cosmos-Reason1-7B \
  --qwen_path /path/to/Qwen2.5-VL-7B-Instruct \
  --experiment_opts net=cosmos_v1_14B_action_chunk_conditioned model.config.fsdp_shard_size=32

Note: The --experiment uses the registered 2B experiment name for Hydra config lookup. The --experiment_opts overrides the network architecture to 14B.

Input Data Format

The inference script expects a directory containing paired files:

  • episode_XXXXXX.mp4 - Input video (first frame used as conditional)
  • episode_XXXXXX_actions.npy - Action array of shape (T, 29) in numpy format

Eval Data Preparation

Use scripts/prepare_gr1_eval_data.py to extract eval episodes from a LeRobot dataset:

python scripts/prepare_gr1_eval_data.py \
  --dataset-path /path/to/GR1_robot \
  --output-dir /path/to/eval_data \
  --num-episodes 100

One-command Eval

bash scripts/eval_gr1_robot.sh /path/to/model_ema_bf16.pt --debug

The eval script auto-detects 14B vs 2B from the checkpoint path and applies the correct network config.

Examples

File Structure

.
β”œβ”€β”€ model_ema_bf16.pt           # Model weights (EMA, bf16, ~28GB)
β”œβ”€β”€ inference/
β”‚   β”œβ”€β”€ inference_gr00t.py      # Main inference script
β”‚   └── inference_pipeline.py   # Inference pipeline (ActionVideo2WorldInference)
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ eval_gr1_robot.sh       # One-command eval script (auto 2B/14B)
β”‚   β”œβ”€β”€ prepare_gr1_eval_data.py # Eval data preparation
β”‚   └── convert_distcp_to_pt.py # DCP -> PT checkpoint converter
└── examples/
    β”œβ”€β”€ predicted/              # Predicted videos
    └── gt/                     # Ground truth videos

Limitations

  • Trained only on GR1 dual-arm robot data; may not generalize to other embodiments
  • Prediction quality degrades for long-horizon generation (many chunks)
  • Episodes with fewer than 12 actions cannot be processed (chunk_size=12)
  • 14B model requires ~60GB GPU memory for inference (single GPU)
  • No classifier-free guidance (guidance=0) works best for this model

License

Apache-2.0 (same as base Cosmos-Predict2.5)

Acknowledgements

Downloads last month
53
Video Preview
loading

Model tree for morinoppp/comos_predict_14B_GR1_action_cond

Finetuned
(1)
this model