TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Paper β’ 2305.07759 β’ Published β’ 45
A GPT-style small language model (~16M parameters) built entirely from scratch using PyTorch and trained on the TinyStories dataset.
This model was built as a learning project to understand transformer architecture, training loops, gradient accumulation, and mixed-precision training from the ground up.
| Property | Value |
|---|---|
| Architecture | GPT-style decoder-only transformer |
| Parameters | ~16M |
| Layers | 4 transformer blocks |
| Attention heads | 4 |
| Embedding dimension | 256 |
| Context window | 64 tokens |
| Vocabulary | GPT-2 BPE (50,257 tokens) |
| Training data | TinyStories (50k stories subset) |
| Training iterations | 3,000 |
| Optimizer | AdamW (beta1=0.9, beta2=0.95, wd=0.1) |
| LR schedule | Linear warmup -> Cosine decay |
pip install transformers torch huggingface_hub
import torch
from transformers import GPT2Tokenizer
from modeling_slm import SLMModel
# Load
tokenizer = GPT2Tokenizer.from_pretrained("YOUR_HF_USERNAME/slm-tinystories")
model = SLMModel.from_pretrained("YOUR_HF_USERNAME/slm-tinystories",
trust_remote_code=True)
model.eval()
# Generate
prompt = "Once upon a time there was a little girl"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(inputs["input_ids"], max_new_tokens=200,
temperature=1.0, top_k=50)
print(tokenizer.decode(output.squeeze().tolist(), skip_special_tokens=True))
The model was trained using: