tinystories-gpt2-124M-scratch

A 124M-parameter GPT-2 style language model trained from scratch on the TinyStories dataset. No Hugging Face transformers library was used — every component (attention, layer norm, feed-forward, positional embeddings) was written in raw PyTorch.

Built as part of a hands-on deep dive into how large language models actually work, following Sebastian Raschka's Build a Large Language Model (From Scratch).


Model Details

Architecture Decoder-only Transformer (GPT-2 style)
Parameters ~124M (with weight tying)
Layers 12 Transformer blocks
Attention heads 12
Embedding dim 768
Context length 128 tokens
Vocabulary 50,257 (GPT-2 BPE via tiktoken)
Training data 20,000 TinyStories samples
Epochs 10
Final train loss ~1.86
Final val loss ~2.05
Hardware NVIDIA T4 (Google Colab)
Precision FP16 mixed precision

Training

The model was trained using PyTorch Lightning with the following setup:

  • Optimizer: AdamW (lr=3e-4, weight_decay=0.1, betas=(0.9, 0.95))
  • Gradient clipping: 1.0
  • Batch size: 8
  • Stride: 128 (non-overlapping chunks)
  • Dataset: roneneldan/TinyStories (20k train / 5k val stories)

Loss dropped from ~21.5 at step 100 to ~1.86 by the end of epoch 10.


Architecture

The model is a clean decoder-only transformer built from scratch:

Token Embedding → Positional Embedding → Dropout
→ 12x [LayerNorm → Masked Multi-Head Attention → Dropout + Residual
       → LayerNorm → Feed-Forward (GELU) → Dropout + Residual]
→ Final LayerNorm → Linear Output Head

Key implementation details:

  • Pre-LayerNorm (more stable than post-LayerNorm)
  • Causal masking via upper-triangular matrix
  • Weight tying between token embedding and output head (reduces params from 162M to 123M)
  • Custom GELU approximation matching GPT-2's implementation

Sample Output

After 10 epochs, the model generates coherent short stories:

Prompt: "Tom and his friend were playing"

Output:

Tom and his friend were playing with the ball. They were having fun until they heard a loud noise. It was a big dog. The dog was barking and running towards them. "Hey, that's our ball!" Tom said. "Give it back!"

Use in Transformers

First, ensure you have the transformers and torch libraries installed:

pip install transformers torch

⚠️ Important: Because this model uses a custom architecture wrapper (hf_model.py), you must pass trust_remote_code=True when loading the model.

Option 1: Text Generation Pipeline (Recommended)

You can easily use this model for text generation via the Hugging Face pipeline:

import torch
from transformers import pipeline

# Set up the text generation pipeline
generator = pipeline(
    "text-generation",
    model="snehangshu511/tinystories-gpt2-124M-scratch",
    trust_remote_code=True, # Required for custom architecture
    device="cuda" if torch.cuda.is_available() else "cpu"
)

# Generate a story!
prompt = "Tom and his friend were playing"
output = generator(
    prompt,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.8,
    top_k=50
)

print(output[0]['generated_text'])

Option 2: Manual Loading (AutoModelForCausalLM)

If you want more control over the tokenization and decoding process, you can load the model and tokenizer directly using transformers:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "snehangshu511/tinystories-gpt2-124M-scratch"

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(repo_id)

# Load Model
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True # Required for custom architecture
)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Prepare input
input_text = "A boy named Max had a big red ball"
inputs = tokenizer(input_text, return_tensors="pt").to(device)

# Generate
with torch.no_grad():
    output_ids = model.generate(
        inputs.input_ids,
        max_new_tokens=50,
        do_sample=True,
        temperature=0.8,
        top_k=50,
        pad_token_id=tokenizer.eos_token_id
    )

# Decode and print output
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated_text)

Intended Use & Limitations


Inference

import torch
import tiktoken

# Load model (define GPTModel class first)
cfg = {
    'vocab_size': 50257,
    'context_length': 128,
    'emb_dim': 768,
    'n_heads': 12,
    'n_layers': 12,
    'drop_rate': 0.1,
    'qkv_bias': False
}

model = GPTModel(cfg)
model.load_state_dict(
    torch.load("tinystories-gpt2-124M-scratch.pt", map_location="cpu")
)
model.eval()

tokenizer = tiktoken.get_encoding("gpt2")

# Generate text
def generate(prompt, max_new_tokens=100, temperature=0.8, top_k=40):
    encoded = torch.tensor(
        tokenizer.encode(prompt, allowed_special={"<|endoftext|>"})
    ).unsqueeze(0)

    token_ids = generate_text(
        model=model,
        idx=encoded,
        max_new_tokens=max_new_tokens,
        context_size=cfg["context_length"],
        temperature=temperature,
        top_k=top_k
    )
    return tokenizer.decode(token_ids.squeeze(0).tolist())

print(generate("Once upon a time, there was a little girl named Lily"))

Good Prompts to Try

This model works best with TinyStories-style prompts:

  • "Once upon a time, there was a little girl named Lily"
  • "Tom and his friend were playing"
  • "One day, a small dog found a"
  • "Sara and her mom went to the park"
  • "The little rabbit was very hungry"
  • "In a small house near the forest"

Files in this Repository

File Description
pytorch_model.bin The trained model weights in Hugging Face format
config.json The custom model configuration (includes auto-mapping)
hf_model.py The custom model architecture required to load the weights
tokenizer.json & tokenizer_config.json GPT-2 Tokenizer files used for text encoding/decoding
README.md This model card

Limitations

  • Trained on simple children's stories only — not suitable for general Q&A or complex reasoning
  • Context window of 128 tokens — outputs may lose coherence beyond that
  • Will repeat phrases occasionally, especially at lower temperatures
  • Not instruction-tuned — it completes text, it does not answer questions

Author

Snehangshu Bhuin

  • GitHub: snehangshu2002
  • Reference: Build a Large Language Model (From Scratch) by Sebastian Raschka
Downloads last month
2,240
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train snehangshu511/tinystories-gpt2-124M-scratch

Space using snehangshu511/tinystories-gpt2-124M-scratch 1