SLM β€” Small Language Model trained on TinyStories

A GPT-style small language model (~16M parameters) built entirely from scratch using PyTorch and trained on the TinyStories dataset.

This model was built as a learning project to understand transformer architecture, training loops, gradient accumulation, and mixed-precision training from the ground up.


Model details

Property Value
Architecture GPT-style decoder-only transformer
Parameters ~16M
Layers 4 transformer blocks
Attention heads 4
Embedding dimension 256
Context window 64 tokens
Vocabulary GPT-2 BPE (50,257 tokens)
Training data TinyStories (50k stories subset)
Training iterations 3,000
Optimizer AdamW (beta1=0.9, beta2=0.95, wd=0.1)
LR schedule Linear warmup -> Cosine decay

How to use

Install dependencies

pip install transformers torch huggingface_hub

Generate text

import torch
from transformers import GPT2Tokenizer
from modeling_slm import SLMModel

# Load
tokenizer = GPT2Tokenizer.from_pretrained("YOUR_HF_USERNAME/slm-tinystories")
model = SLMModel.from_pretrained("YOUR_HF_USERNAME/slm-tinystories",
                                 trust_remote_code=True)
model.eval()

# Generate
prompt = "Once upon a time there was a little girl"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(inputs["input_ids"], max_new_tokens=200,
                        temperature=1.0, top_k=50)
print(tokenizer.decode(output.squeeze().tolist(), skip_special_tokens=True))

Training approach

The model was trained using:

  • Gradient accumulation (8 steps) to simulate an effective batch size of 128 sequences without needing that much GPU memory at once
  • Mixed precision (bfloat16) for faster training and lower memory usage
  • Gradient clipping (max norm 0.5) for training stability
  • Weight tying between the token embedding matrix and the output projection layer β€” saves parameters and improves learning

Limitations

  • Context window of 64 tokens is short β€” long prompts will be truncated
  • Trained on a small 50k-story subset, so vocabulary and topics are limited
  • Output quality is intentionally modest β€” this is a from-scratch learning project, not a production model
  • Not suitable for any real-world application

What I learned building this

  • How transformer blocks work internally (attention, MLP, residual connections, layer norm)
  • How gradient accumulation simulates large batch training on limited hardware
  • How the training loop, optimizer, and LR scheduler fit together
  • How to package a custom PyTorch model for HuggingFace Hub

References

Downloads last month
422
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train kiruthiga-99/slm-tinystories

Paper for kiruthiga-99/slm-tinystories