SLM — Small Language Model trained on TinyStories

A GPT-style small language model (~16M parameters) built entirely from scratch using PyTorch and trained on the TinyStories dataset.

This model was built as a learning project to understand transformer architecture, training loops, gradient accumulation, and mixed-precision training from the ground up.

Model details

Property	Value
Architecture	GPT-style decoder-only transformer
Parameters	~16M
Layers	4 transformer blocks
Attention heads	4
Embedding dimension	256
Context window	64 tokens
Vocabulary	GPT-2 BPE (50,257 tokens)
Training data	TinyStories (50k stories subset)
Training iterations	3,000
Optimizer	AdamW (beta1=0.9, beta2=0.95, wd=0.1)
LR schedule	Linear warmup -> Cosine decay

How to use

Install dependencies

pip install transformers torch huggingface_hub

Generate text

import torch
from transformers import GPT2Tokenizer
from modeling_slm import SLMModel

# Load
tokenizer = GPT2Tokenizer.from_pretrained("YOUR_HF_USERNAME/slm-tinystories")
model = SLMModel.from_pretrained("YOUR_HF_USERNAME/slm-tinystories",
                                 trust_remote_code=True)
model.eval()

# Generate
prompt = "Once upon a time there was a little girl"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(inputs["input_ids"], max_new_tokens=200,
                        temperature=1.0, top_k=50)
print(tokenizer.decode(output.squeeze().tolist(), skip_special_tokens=True))

Training approach

The model was trained using:

Gradient accumulation (8 steps) to simulate an effective batch size of 128 sequences without needing that much GPU memory at once
Mixed precision (bfloat16) for faster training and lower memory usage
Gradient clipping (max norm 0.5) for training stability
Weight tying between the token embedding matrix and the output projection layer — saves parameters and improves learning

Limitations

Context window of 64 tokens is short — long prompts will be truncated
Trained on a small 50k-story subset, so vocabulary and topics are limited
Output quality is intentionally modest — this is a from-scratch learning project, not a production model
Not suitable for any real-world application

What I learned building this

How transformer blocks work internally (attention, MLP, residual connections, layer norm)
How gradient accumulation simulates large batch training on limited hardware
How the training loop, optimizer, and LR scheduler fit together
How to package a custom PyTorch model for HuggingFace Hub

References

TinyStories paper
nanoGPT by Karpathy — architecture reference
Vizuara SLM tutorial — workshop this was built from

Downloads last month: 422

Dataset used to train kiruthiga-99/slm-tinystories

Paper for kiruthiga-99/slm-tinystories

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Paper • 2305.07759 • Published May 12, 2023 • 45