Cosmos-Predict2.5-14B GR1 Action-Conditioned World Model

Action-conditioned video prediction model fine-tuned from nvidia/Cosmos-Predict2.5-14B on GR1 dual-arm robot teleoperation data.

Given an initial observation frame and a sequence of robot actions, this model predicts future video frames that depict the robot executing those actions.

Model Details

Base model: Cosmos-Predict2.5-14B (video2world pre-trained)
Fine-tuning data: GR1 dual-arm robot teleoperation (LeRobot format)
Action dimension: 29 (left_arm:7 + left_hand:6 + right_arm:7 + right_hand:6 + waist:3)
Video frames: 13 frames per chunk (1 conditional + 12 predicted)
Temporal compression: 4x (13 image frames = 4 latent frames)
Resolution: 480x832
Network: cosmos_v1_14B_action_chunk_conditioned (36 blocks, 5120 channels, 40 heads)
Checkpoint format: EMA weights in bf16 (model_ema_bf16.pt, ~28GB)

Training Configuration

Parameter	Value
GPUs	8x H200 (140GB)
Batch size	2 per GPU (global=16)
Learning rate	4e-5
Max iterations	4,000+
Optimizer	AdamW (cosine schedule)
Precision	bf16
Episodes per task	3 (uniform sampling)
FSDP shard size	32
VAE	Wan2.1
Text encoder	Cosmos-Reason1-7B
Tokenizer	Qwen2.5-VL-7B-Instruct

Inference

Prerequisites

This model requires the Cosmos-Predict2.5 codebase and the following dependencies:

Wan2.1 VAE: Wan2.1_VAE.pth (14B model includes its own copy)
Text encoder: Cosmos-Reason1-7B
Tokenizer: Qwen2.5-VL-7B-Instruct

Quick Start

python cosmos_predict2/_src/predict2/action/inference/inference_gr00t.py \
  --experiment=cosmos_predict2p5_2B_action_conditioned_gr00t_gr1_customized_13frame_full_16nodes_release_oss \
  --ckpt_path=/path/to/model_ema_bf16.pt \
  --input_video_root=/path/to/eval_data \
  --save_root=/path/to/output \
  --resolution 480,832 \
  --guidance 0 \
  --chunk_size 12 \
  --fps_downsample_ratio 2 \
  --save_fps 10 \
  --vae_path /path/to/Wan2.1_VAE.pth \
  --text_encoder_path /path/to/Cosmos-Reason1-7B \
  --qwen_path /path/to/Qwen2.5-VL-7B-Instruct \
  --experiment_opts net=cosmos_v1_14B_action_chunk_conditioned model.config.fsdp_shard_size=32

Note: The --experiment uses the registered 2B experiment name for Hydra config lookup. The --experiment_opts overrides the network architecture to 14B.

Input Data Format

The inference script expects a directory containing paired files:

episode_XXXXXX.mp4 - Input video (first frame used as conditional)
episode_XXXXXX_actions.npy - Action array of shape (T, 29) in numpy format

Eval Data Preparation

Use scripts/prepare_gr1_eval_data.py to extract eval episodes from a LeRobot dataset:

python scripts/prepare_gr1_eval_data.py \
  --dataset-path /path/to/GR1_robot \
  --output-dir /path/to/eval_data \
  --num-episodes 100

One-command Eval

bash scripts/eval_gr1_robot.sh /path/to/model_ema_bf16.pt --debug

The eval script auto-detects 14B vs 2B from the checkpoint path and applies the correct network config.

Examples

Episode	Prediction	Ground Truth
000000	episode_000000.mp4	episode_000000.mp4
000008	episode_000008.mp4	episode_000008.mp4
000056	episode_000056.mp4	episode_000056.mp4
000084	episode_000084.mp4	episode_000084.mp4
000087	episode_000087.mp4	episode_000087.mp4

File Structure

.
├── model_ema_bf16.pt           # Model weights (EMA, bf16, ~28GB)
├── inference/
│   ├── inference_gr00t.py      # Main inference script
│   └── inference_pipeline.py   # Inference pipeline (ActionVideo2WorldInference)
├── scripts/
│   ├── eval_gr1_robot.sh       # One-command eval script (auto 2B/14B)
│   ├── prepare_gr1_eval_data.py # Eval data preparation
│   └── convert_distcp_to_pt.py # DCP -> PT checkpoint converter
└── examples/
    ├── predicted/              # Predicted videos
    └── gt/                     # Ground truth videos

Limitations

Trained only on GR1 dual-arm robot data; may not generalize to other embodiments
Prediction quality degrades for long-horizon generation (many chunks)
Episodes with fewer than 12 actions cannot be processed (chunk_size=12)
14B model requires ~60GB GPU memory for inference (single GPU)
No classifier-free guidance (guidance=0) works best for this model

License

Apache-2.0 (same as base Cosmos-Predict2.5)

Acknowledgements

Base model: NVIDIA Cosmos-Predict2.5
Training data: GR1 teleoperation dataset (LeRobot format)
VAE: Wan2.1

Downloads last month: 53

Video Preview

Robotics

Model tree for morinoppp/comos_predict_14B_GR1_action_cond

Base model

nvidia/Cosmos-Predict2.5-14B

Finetuned

(1)

this model