Trelis
/

nanochat-recursive

Model card Files Files and versions

RonanMcGovern commited on Dec 17, 2025

Commit

b84ec1c

·

verified ·

1 Parent(s): 03d871c

add wandb and gsm8k plot

Files changed (1) hide show

README.md +6 -1

README.md CHANGED Viewed

@@ -11,6 +11,7 @@ About Recursive NanoChat:
 - This model is trained on the same amount of data and using roughly the same number of flops as a [non-recursive 20 layer nanochat model](https://huggingface.co/Trelis/nanochat).
 - It has roughly half of the original model parameters.
 - See the `recursive` branch of [https://github.com/TrelisResearch/nanochat]() for details on how to download and run.
 ## Design Choices
 - (P, R, C) = (2, 4, 2) → 8 unique layer weights # 2 prelude layers, 4 recursive and 2 coda
@@ -40,4 +41,8 @@ About Recursive NanoChat:
 ![ChatCORE vs Recursions](chatcore_vs_recursions.png)
-*The recursive model (8 unique layers, ~328M params) approaches d20 performance (20 unique layers, ~561M params) as test-time recurrences increase. At r=4 (iso-flops), the recursive model achieves 94% of d20's ChatCORE with 42% fewer parameters.*

 - This model is trained on the same amount of data and using roughly the same number of flops as a [non-recursive 20 layer nanochat model](https://huggingface.co/Trelis/nanochat).
 - It has roughly half of the original model parameters.
 - See the `recursive` branch of [https://github.com/TrelisResearch/nanochat]() for details on how to download and run.
+- See full logs on [Weights and Biases](https://wandb.ai/trelis/nanochat).
 ## Design Choices
 - (P, R, C) = (2, 4, 2) → 8 unique layer weights # 2 prelude layers, 4 recursive and 2 coda
 ![ChatCORE vs Recursions](chatcore_vs_recursions.png)
+*The recursive model (8 unique layers, ~328M params) approaches d20 performance (20 unique layers, ~561M params) as test-time recurrences increase. At r=4 (iso-flops), the recursive model achieves 94% of d20's ChatCORE with 42% fewer parameters.*
+![GSM8K vs Recursions](gsm8k_vs_recursions.png)
+*On GSM8K (math reasoning), the recursive model surpasses d20 at r>=4, suggesting that iterative refinement through recurrence may particularly benefit reasoning tasks.*