add wandb and gsm8k plot
Browse files
README.md
CHANGED
|
@@ -11,6 +11,7 @@ About Recursive NanoChat:
|
|
| 11 |
- This model is trained on the same amount of data and using roughly the same number of flops as a [non-recursive 20 layer nanochat model](https://huggingface.co/Trelis/nanochat).
|
| 12 |
- It has roughly half of the original model parameters.
|
| 13 |
- See the `recursive` branch of [https://github.com/TrelisResearch/nanochat]() for details on how to download and run.
|
|
|
|
| 14 |
|
| 15 |
## Design Choices
|
| 16 |
- (P, R, C) = (2, 4, 2) → 8 unique layer weights # 2 prelude layers, 4 recursive and 2 coda
|
|
@@ -40,4 +41,8 @@ About Recursive NanoChat:
|
|
| 40 |
|
| 41 |

|
| 42 |
|
| 43 |
-
*The recursive model (8 unique layers, ~328M params) approaches d20 performance (20 unique layers, ~561M params) as test-time recurrences increase. At r=4 (iso-flops), the recursive model achieves 94% of d20's ChatCORE with 42% fewer parameters.*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
- This model is trained on the same amount of data and using roughly the same number of flops as a [non-recursive 20 layer nanochat model](https://huggingface.co/Trelis/nanochat).
|
| 12 |
- It has roughly half of the original model parameters.
|
| 13 |
- See the `recursive` branch of [https://github.com/TrelisResearch/nanochat]() for details on how to download and run.
|
| 14 |
+
- See full logs on [Weights and Biases](https://wandb.ai/trelis/nanochat).
|
| 15 |
|
| 16 |
## Design Choices
|
| 17 |
- (P, R, C) = (2, 4, 2) → 8 unique layer weights # 2 prelude layers, 4 recursive and 2 coda
|
|
|
|
| 41 |
|
| 42 |

|
| 43 |
|
| 44 |
+
*The recursive model (8 unique layers, ~328M params) approaches d20 performance (20 unique layers, ~561M params) as test-time recurrences increase. At r=4 (iso-flops), the recursive model achieves 94% of d20's ChatCORE with 42% fewer parameters.*
|
| 45 |
+
|
| 46 |
+

|
| 47 |
+
|
| 48 |
+
*On GSM8K (math reasoning), the recursive model surpasses d20 at r>=4, suggesting that iterative refinement through recurrence may particularly benefit reasoning tasks.*
|