Reasoning-Trajectory Misalignment: Is RL-aligned checkpoint planned?

#11

by dedarrow - opened 2 days ago

2 days ago

I've been extensively testing Alpamayo 1 with AlpaSim and found that the Chain-of-Causation reasoning frequently contradicts the actual trajectory output — reasoning says "nudge left to pass parked car," but the trajectory curves right, causing collisions.
Reproduced on:

DGX Spark (ARM64)
4x H100 (x86)

The GitHub FAQ confirms this release is SFT-only without RL post-training. Per the paper (arXiv:2511.00088), RL post-training improves "reasoning-action consistency by 37%" — which appears to be exactly what's missing.
Questions:

Is this expected behavior for the SFT-only release?
Is there a timeline for releasing the RL-aligned checkpoint?

Detailed findings with video evidence:
https://github.com/NVlabs/alpasim/issues/20
https://github.com/NVlabs/alpamayo/issues/38

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment