tinystories-gpt2-124M-scratch
A 124M-parameter GPT-2 style language model trained from scratch on the TinyStories dataset. No Hugging Face transformers library was used — every component (attention, layer norm, feed-forward, positional embeddings) was written in raw PyTorch.
Built as part of a hands-on deep dive into how large language models actually work, following Sebastian Raschka's Build a Large Language Model (From Scratch).
Model Details
| Architecture | Decoder-only Transformer (GPT-2 style) |
| Parameters | ~124M (with weight tying) |
| Layers | 12 Transformer blocks |
| Attention heads | 12 |
| Embedding dim | 768 |
| Context length | 128 tokens |
| Vocabulary | 50,257 (GPT-2 BPE via tiktoken) |
| Training data | 20,000 TinyStories samples |
| Epochs | 10 |
| Final train loss | ~1.86 |
| Final val loss | ~2.05 |
| Hardware | NVIDIA T4 (Google Colab) |
| Precision | FP16 mixed precision |
Training
The model was trained using PyTorch Lightning with the following setup:
- Optimizer: AdamW (lr=3e-4, weight_decay=0.1, betas=(0.9, 0.95))
- Gradient clipping: 1.0
- Batch size: 8
- Stride: 128 (non-overlapping chunks)
- Dataset:
roneneldan/TinyStories(20k train / 5k val stories)
Loss dropped from ~21.5 at step 100 to ~1.86 by the end of epoch 10.
Architecture
The model is a clean decoder-only transformer built from scratch:
Token Embedding → Positional Embedding → Dropout
→ 12x [LayerNorm → Masked Multi-Head Attention → Dropout + Residual
→ LayerNorm → Feed-Forward (GELU) → Dropout + Residual]
→ Final LayerNorm → Linear Output Head
Key implementation details:
- Pre-LayerNorm (more stable than post-LayerNorm)
- Causal masking via upper-triangular matrix
- Weight tying between token embedding and output head (reduces params from 162M to 123M)
- Custom GELU approximation matching GPT-2's implementation
Sample Output
After 10 epochs, the model generates coherent short stories:
Prompt: "Tom and his friend were playing"
Output:
Tom and his friend were playing with the ball. They were having fun until they heard a loud noise. It was a big dog. The dog was barking and running towards them. "Hey, that's our ball!" Tom said. "Give it back!"
Use in Transformers
First, ensure you have the transformers and torch libraries installed:
pip install transformers torch
⚠️ Important: Because this model uses a custom architecture wrapper (hf_model.py), you must pass trust_remote_code=True when loading the model.
Option 1: Text Generation Pipeline (Recommended)
You can easily use this model for text generation via the Hugging Face pipeline:
import torch
from transformers import pipeline
# Set up the text generation pipeline
generator = pipeline(
"text-generation",
model="snehangshu511/tinystories-gpt2-124M-scratch",
trust_remote_code=True, # Required for custom architecture
device="cuda" if torch.cuda.is_available() else "cpu"
)
# Generate a story!
prompt = "Tom and his friend were playing"
output = generator(
prompt,
max_new_tokens=50,
do_sample=True,
temperature=0.8,
top_k=50
)
print(output[0]['generated_text'])
Option 2: Manual Loading (AutoModelForCausalLM)
If you want more control over the tokenization and decoding process, you can load the model and tokenizer directly using transformers:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
repo_id = "snehangshu511/tinystories-gpt2-124M-scratch"
# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(repo_id)
# Load Model
model = AutoModelForCausalLM.from_pretrained(
repo_id,
trust_remote_code=True # Required for custom architecture
)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
# Prepare input
input_text = "A boy named Max had a big red ball"
inputs = tokenizer(input_text, return_tensors="pt").to(device)
# Generate
with torch.no_grad():
output_ids = model.generate(
inputs.input_ids,
max_new_tokens=50,
do_sample=True,
temperature=0.8,
top_k=50,
pad_token_id=tokenizer.eos_token_id
)
# Decode and print output
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated_text)
Intended Use & Limitations
Inference
import torch
import tiktoken
# Load model (define GPTModel class first)
cfg = {
'vocab_size': 50257,
'context_length': 128,
'emb_dim': 768,
'n_heads': 12,
'n_layers': 12,
'drop_rate': 0.1,
'qkv_bias': False
}
model = GPTModel(cfg)
model.load_state_dict(
torch.load("tinystories-gpt2-124M-scratch.pt", map_location="cpu")
)
model.eval()
tokenizer = tiktoken.get_encoding("gpt2")
# Generate text
def generate(prompt, max_new_tokens=100, temperature=0.8, top_k=40):
encoded = torch.tensor(
tokenizer.encode(prompt, allowed_special={"<|endoftext|>"})
).unsqueeze(0)
token_ids = generate_text(
model=model,
idx=encoded,
max_new_tokens=max_new_tokens,
context_size=cfg["context_length"],
temperature=temperature,
top_k=top_k
)
return tokenizer.decode(token_ids.squeeze(0).tolist())
print(generate("Once upon a time, there was a little girl named Lily"))
Good Prompts to Try
This model works best with TinyStories-style prompts:
"Once upon a time, there was a little girl named Lily""Tom and his friend were playing""One day, a small dog found a""Sara and her mom went to the park""The little rabbit was very hungry""In a small house near the forest"
Files in this Repository
| File | Description |
|---|---|
pytorch_model.bin |
The trained model weights in Hugging Face format |
config.json |
The custom model configuration (includes auto-mapping) |
hf_model.py |
The custom model architecture required to load the weights |
tokenizer.json & tokenizer_config.json |
GPT-2 Tokenizer files used for text encoding/decoding |
README.md |
This model card |
Limitations
- Trained on simple children's stories only — not suitable for general Q&A or complex reasoning
- Context window of 128 tokens — outputs may lose coherence beyond that
- Will repeat phrases occasionally, especially at lower temperatures
- Not instruction-tuned — it completes text, it does not answer questions
Author
Snehangshu Bhuin
- GitHub: snehangshu2002
- Reference: Build a Large Language Model (From Scratch) by Sebastian Raschka
- Downloads last month
- 2,240