tinystories-gpt2-124M-scratch

A 124M-parameter GPT-2 style language model trained from scratch on the TinyStories dataset. No Hugging Face transformers library was used — every component (attention, layer norm, feed-forward, positional embeddings) was written in raw PyTorch.

Built as part of a hands-on deep dive into how large language models actually work, following Sebastian Raschka's Build a Large Language Model (From Scratch).

Model Details


Architecture	Decoder-only Transformer (GPT-2 style)
Parameters	~124M (with weight tying)
Layers	12 Transformer blocks
Attention heads	12
Embedding dim	768
Context length	128 tokens
Vocabulary	50,257 (GPT-2 BPE via tiktoken)
Training data	20,000 TinyStories samples
Epochs	10
Final train loss	~1.86
Final val loss	~2.05
Hardware	NVIDIA T4 (Google Colab)
Precision	FP16 mixed precision

Training

The model was trained using PyTorch Lightning with the following setup:

Optimizer: AdamW (lr=3e-4, weight_decay=0.1, betas=(0.9, 0.95))
Gradient clipping: 1.0
Batch size: 8
Stride: 128 (non-overlapping chunks)
Dataset: roneneldan/TinyStories (20k train / 5k val stories)

Loss dropped from ~21.5 at step 100 to ~1.86 by the end of epoch 10.

Architecture

The model is a clean decoder-only transformer built from scratch:

Token Embedding → Positional Embedding → Dropout
→ 12x [LayerNorm → Masked Multi-Head Attention → Dropout + Residual
       → LayerNorm → Feed-Forward (GELU) → Dropout + Residual]
→ Final LayerNorm → Linear Output Head

Key implementation details:

Pre-LayerNorm (more stable than post-LayerNorm)
Causal masking via upper-triangular matrix
Weight tying between token embedding and output head (reduces params from 162M to 123M)
Custom GELU approximation matching GPT-2's implementation

Sample Output

After 10 epochs, the model generates coherent short stories:

Prompt: "Tom and his friend were playing"

Output:

Tom and his friend were playing with the ball. They were having fun until they heard a loud noise. It was a big dog. The dog was barking and running towards them. "Hey, that's our ball!" Tom said. "Give it back!"

Use in Transformers

First, ensure you have the transformers and torch libraries installed:

pip install transformers torch

⚠️ Important: Because this model uses a custom architecture wrapper (hf_model.py), you must pass trust_remote_code=True when loading the model.

Option 1: Text Generation Pipeline (Recommended)

You can easily use this model for text generation via the Hugging Face pipeline:

import torch
from transformers import pipeline

# Set up the text generation pipeline
generator = pipeline(
    "text-generation",
    model="snehangshu511/tinystories-gpt2-124M-scratch",
    trust_remote_code=True, # Required for custom architecture
    device="cuda" if torch.cuda.is_available() else "cpu"
)

# Generate a story!
prompt = "Tom and his friend were playing"
output = generator(
    prompt,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.8,
    top_k=50
)

print(output[0]['generated_text'])

Option 2: Manual Loading (AutoModelForCausalLM)

If you want more control over the tokenization and decoding process, you can load the model and tokenizer directly using transformers:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "snehangshu511/tinystories-gpt2-124M-scratch"

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(repo_id)

# Load Model
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True # Required for custom architecture
)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Prepare input
input_text = "A boy named Max had a big red ball"
inputs = tokenizer(input_text, return_tensors="pt").to(device)

# Generate
with torch.no_grad():
    output_ids = model.generate(
        inputs.input_ids,
        max_new_tokens=50,
        do_sample=True,
        temperature=0.8,
        top_k=50,
        pad_token_id=tokenizer.eos_token_id
    )

# Decode and print output
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated_text)

Intended Use & Limitations

Inference

import torch
import tiktoken

# Load model (define GPTModel class first)
cfg = {
    'vocab_size': 50257,
    'context_length': 128,
    'emb_dim': 768,
    'n_heads': 12,
    'n_layers': 12,
    'drop_rate': 0.1,
    'qkv_bias': False
}

model = GPTModel(cfg)
model.load_state_dict(
    torch.load("tinystories-gpt2-124M-scratch.pt", map_location="cpu")
)
model.eval()

tokenizer = tiktoken.get_encoding("gpt2")

# Generate text
def generate(prompt, max_new_tokens=100, temperature=0.8, top_k=40):
    encoded = torch.tensor(
        tokenizer.encode(prompt, allowed_special={"<|endoftext|>"})
    ).unsqueeze(0)

    token_ids = generate_text(
        model=model,
        idx=encoded,
        max_new_tokens=max_new_tokens,
        context_size=cfg["context_length"],
        temperature=temperature,
        top_k=top_k
    )
    return tokenizer.decode(token_ids.squeeze(0).tolist())

print(generate("Once upon a time, there was a little girl named Lily"))

Good Prompts to Try

This model works best with TinyStories-style prompts:

"Once upon a time, there was a little girl named Lily"
"Tom and his friend were playing"
"One day, a small dog found a"
"Sara and her mom went to the park"
"The little rabbit was very hungry"
"In a small house near the forest"

Files in this Repository

File	Description
`pytorch_model.bin`	The trained model weights in Hugging Face format
`config.json`	The custom model configuration (includes auto-mapping)
`hf_model.py`	The custom model architecture required to load the weights
`tokenizer.json` & `tokenizer_config.json`	GPT-2 Tokenizer files used for text encoding/decoding
`README.md`	This model card

Limitations

Trained on simple children's stories only — not suitable for general Q&A or complex reasoning
Context window of 128 tokens — outputs may lose coherence beyond that
Will repeat phrases occasionally, especially at lower temperatures
Not instruction-tuned — it completes text, it does not answer questions

Author

Snehangshu Bhuin

GitHub: snehangshu2002
Reference: Build a Large Language Model (From Scratch) by Sebastian Raschka

Downloads last month: 2,240

Safetensors

Model size

0.1B params

Tensor type

F32

snehangshu511
/

tinystories-gpt2-124M-scratch