Reasoning-Trajectory Misalignment: Is RL-aligned checkpoint planned?
#11
by
dedarrow
- opened
I've been extensively testing Alpamayo 1 with AlpaSim and found that the Chain-of-Causation reasoning frequently contradicts the actual trajectory output β reasoning says "nudge left to pass parked car," but the trajectory curves right, causing collisions.
Reproduced on:
DGX Spark (ARM64)
4x H100 (x86)
The GitHub FAQ confirms this release is SFT-only without RL post-training. Per the paper (arXiv:2511.00088), RL post-training improves "reasoning-action consistency by 37%" β which appears to be exactly what's missing.
Questions:
- Is this expected behavior for the SFT-only release?
- Is there a timeline for releasing the RL-aligned checkpoint?
Detailed findings with video evidence:
https://github.com/NVlabs/alpasim/issues/20
https://github.com/NVlabs/alpamayo/issues/38