RonanMcGovern commited on
Commit
b84ec1c
·
verified ·
1 Parent(s): 03d871c

add wandb and gsm8k plot

Browse files
Files changed (1) hide show
  1. README.md +6 -1
README.md CHANGED
@@ -11,6 +11,7 @@ About Recursive NanoChat:
11
  - This model is trained on the same amount of data and using roughly the same number of flops as a [non-recursive 20 layer nanochat model](https://huggingface.co/Trelis/nanochat).
12
  - It has roughly half of the original model parameters.
13
  - See the `recursive` branch of [https://github.com/TrelisResearch/nanochat]() for details on how to download and run.
 
14
 
15
  ## Design Choices
16
  - (P, R, C) = (2, 4, 2) → 8 unique layer weights # 2 prelude layers, 4 recursive and 2 coda
@@ -40,4 +41,8 @@ About Recursive NanoChat:
40
 
41
  ![ChatCORE vs Recursions](chatcore_vs_recursions.png)
42
 
43
- *The recursive model (8 unique layers, ~328M params) approaches d20 performance (20 unique layers, ~561M params) as test-time recurrences increase. At r=4 (iso-flops), the recursive model achieves 94% of d20's ChatCORE with 42% fewer parameters.*
 
 
 
 
 
11
  - This model is trained on the same amount of data and using roughly the same number of flops as a [non-recursive 20 layer nanochat model](https://huggingface.co/Trelis/nanochat).
12
  - It has roughly half of the original model parameters.
13
  - See the `recursive` branch of [https://github.com/TrelisResearch/nanochat]() for details on how to download and run.
14
+ - See full logs on [Weights and Biases](https://wandb.ai/trelis/nanochat).
15
 
16
  ## Design Choices
17
  - (P, R, C) = (2, 4, 2) → 8 unique layer weights # 2 prelude layers, 4 recursive and 2 coda
 
41
 
42
  ![ChatCORE vs Recursions](chatcore_vs_recursions.png)
43
 
44
+ *The recursive model (8 unique layers, ~328M params) approaches d20 performance (20 unique layers, ~561M params) as test-time recurrences increase. At r=4 (iso-flops), the recursive model achieves 94% of d20's ChatCORE with 42% fewer parameters.*
45
+
46
+ ![GSM8K vs Recursions](gsm8k_vs_recursions.png)
47
+
48
+ *On GSM8K (math reasoning), the recursive model surpasses d20 at r>=4, suggesting that iterative refinement through recurrence may particularly benefit reasoning tasks.*