Instructions to use YoAbriel/KodaLite-1.3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use YoAbriel/KodaLite-1.3B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="YoAbriel/KodaLite-1.3B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("YoAbriel/KodaLite-1.3B") model = AutoModelForCausalLM.from_pretrained("YoAbriel/KodaLite-1.3B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use YoAbriel/KodaLite-1.3B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "YoAbriel/KodaLite-1.3B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "YoAbriel/KodaLite-1.3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/YoAbriel/KodaLite-1.3B
- SGLang
How to use YoAbriel/KodaLite-1.3B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "YoAbriel/KodaLite-1.3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "YoAbriel/KodaLite-1.3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "YoAbriel/KodaLite-1.3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "YoAbriel/KodaLite-1.3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use YoAbriel/KodaLite-1.3B with Docker Model Runner:
docker model run hf.co/YoAbriel/KodaLite-1.3B
- KodaLite-1.3B (Koda-v0.1)
- Benchmark results (zero-shot, 8 standard tasks)
- Why KodaLite scores below GPT-2-124M (despite being 10× bigger)
- Chat Format
- Usage (Transformers)
- Usage (MLX — Apple Silicon)
- Usage (llama.cpp / Ollama / LM Studio)
- Architecture (LLaMA-compatible)
- Training
- Limitations
- Lessons learned (for a potential v0.2)
- License
KodaLite-1.3B (Koda-v0.1)
A 1.27B parameter LLaMA-style decoder-only language model, trained entirely from scratch on 2x NVIDIA L40S GPUs using JAX + Flax NNX, then converted to HuggingFace Transformers format.
TL;DR — KodaLite reaches ~37% average accuracy on standard LLM benchmarks. It is severely undertrained (only 1.64B tokens vs 40B–3T for comparable models), which places it just below GPT-2-124M despite having 10× more parameters. A nice illustration of the Chinchilla scaling law: tokens matter more than parameters at this budget.
Benchmark results (zero-shot, 8 standard tasks)
Evaluated against 8 comparable ~1B-parameter models on the same benchmarks (HellaSwag, ARC-E/C, WinoGrande, PIQA, BoolQ, OpenBookQA, LAMBADA-OpenAI).
| Rank | Model | Params | Train tokens | Avg accuracy |
|---|---|---|---|---|
| 1 | TinyLlama-1.1B | 1.10B | 3000B | 50.3% |
| 2 | Pythia-1.4B | 1.41B | 300B | 50.2% |
| 3 | GPT-2-XL | 1.56B | 40B | 49.4% |
| 4 | OPT-1.3B | 1.32B | 180B | 49.1% |
| 5 | Pythia-1B | 1.01B | 300B | 47.6% |
| 6 | GPT-2-large | 0.77B | 40B | 46.2% |
| 7 | GPT-2-medium | 0.35B | 40B | 44.2% |
| 8 | GPT-2-124m | 0.12B | 40B | 39.7% |
| 9 | KodaLite-1.3B | 1.27B | 1.64B | 36.8% |
Per-task breakdown
| Task | KodaLite-1.3B | GPT-2-124M | GPT-2-XL | Pythia-1.4B | TinyLlama-1.1B | Random |
|---|---|---|---|---|---|---|
| HellaSwag | 25.65 | 29.22 | 47.94 | 49.21 | 56.2 | 25.0 |
| ARC-Easy | 32.79 | 38.30 | 50.80 | 51.73 | 43.9 | 25.0 |
| ARC-Challenge | 21.50 | 22.70 | 28.16 | 29.01 | 30.0 | 25.0 |
| WinoGrande | 49.57 | 49.49 | 51.93 | 52.88 | 52.2 | 50.0 |
| PIQA | 58.92 | 62.24 | 70.89 | 71.22 | 72.1 | 50.0 |
| BoolQ | 44.34 | 49.76 | 61.59 | 63.70 | 60.6 | 50.0 |
| OpenBookQA | 25.00 | 26.40 | 34.20 | 33.40 | 37.2 | 25.0 |
| LAMBADA (acc / ppl) | 18.22 / 93.8 | 30.84 / 17.5 | 50.79 / 6.4 | 61.03 / 3.8 | — | — |
Why KodaLite scores below GPT-2-124M (despite being 10× bigger)
The Chinchilla scaling law (DeepMind, 2022) states that a model with N parameters needs approximately 20×N training tokens to be well-trained:
| Model | Params | Chinchilla target (~20× params) | Actual tokens | Ratio |
|---|---|---|---|---|
| KodaLite-1.3B | 1.27B | ~25B | 1.64B | 6.5 % 🔴 |
| GPT-2-XL | 1.5B | ~30B | 40B | 133 % |
| Pythia-1.4B | 1.4B | ~28B | 300B | 1070 % |
| TinyLlama-1.1B | 1.1B | ~22B | 3000B | 13600 % |
KodaLite has seen only 6.5% of what it would need to be competitive. A bigger but undertrained model scores lower than a smaller but well-trained one. The LAMBADA perplexity (94 vs 17 for GPT-2-124M) is the clearest signal: the base language modeling is not converged.
On PIQA (physical commonsense) the gap is smallest — that kind of knowledge appears to be learned faster than factual knowledge or precise language modeling.
Chat Format
Model uses 3 text markers (<|user|>, <|assistant|>, <|end|>) followed by <|endoftext|> (token id 50256, the GPT-2 BPE EOS):
<|user|>
Your question
<|assistant|>
Model response
<|end|><|endoftext|>
A short LoRA pass (May 2026) taught the model to emit <|endoftext|> (50256) right after <|end|>, so generation now stops natively on EOS in Transformers, MLX, llama.cpp, Ollama, and LM Studio, without any stop_strings workaround.
Usage (Transformers)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("YoAbriel/KodaLite-1.3B")
model = AutoModelForCausalLM.from_pretrained(
"YoAbriel/KodaLite-1.3B", dtype=torch.bfloat16, device_map="auto"
)
msg = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tok.apply_chat_template(msg, tokenize=False, add_generation_prompt=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(
**inputs, max_new_tokens=150, do_sample=True, temperature=0.7, top_k=40,
)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
Usage (MLX — Apple Silicon)
See YoAbriel/KodaLite-1.3B-mlx.
from mlx_lm import load, generate
model, tok = load("YoAbriel/KodaLite-1.3B-mlx-8bit")
prompt = tok.apply_chat_template(
[{"role": "user", "content": "What is the capital of France?"}],
tokenize=False,
)
print(generate(model, tok, prompt=prompt, max_tokens=150))
Usage (llama.cpp / Ollama / LM Studio)
See YoAbriel/KodaLite-1.3B-GGUF.
ollama run hf.co/YoAbriel/KodaLite-1.3B-GGUF:Q4_K_M
In LM Studio, just load the GGUF file. The model now emits <|endoftext|> (token 50256) at the end of every turn, so it stops natively without any Stop String configuration. Output may include the <|end|> text marker, which is harmless and easy to strip.
Architecture (LLaMA-compatible)
| Component | Value |
|---|---|
| Parameters | 1.27B |
| Layers | 24 |
| Hidden size | 2048 |
| Attention | GQA (32Q / 8KV heads) |
| Head dim | 64 |
| FFN | SwiGLU, intermediate 5504 |
| Normalization | RMSNorm (pre-norm) |
| Position | RoPE (theta=10000) |
| Context | 1024 tokens |
| Vocab | 50,257 (GPT-2 BPE) |
Training
Pre-training
- Dataset: SlimPajama-6B (streaming)
- Tokens seen: 1.64B
- Hardware: 2x NVIDIA L40S (96GB VRAM total)
- Precision: bfloat16
- Framework: JAX + Flax NNX (trained from scratch, no base model)
SFT
- Datasets: Databricks Dolly-15K + OpenAssistant OASST1
- Method: LoRA (rank=16, alpha=32), then merged into base weights
- End-of-turn:
<|end|>(5 BPE tokens) followed by<|endoftext|>(token 50256, the GPT-2 EOS)
EOS fine-tune (May 2026)
- Goal: teach the model to emit
<|endoftext|>(50256) right after<|end|>so any framework with single-token EOS support (GGUF, MLX, Transformers) can stop natively. - Method: 200 extra LoRA steps on the existing SFT corpus with
<|endoftext|>appended after each<|end|>boundary. - Result: 5/5 MLX 8bit and llama.cpp tests stop on EOS without
stop_stringsworkarounds.
Limitations
- Severely undertrained (6.5% of Chinchilla-optimal) — factual accuracy is low
- May produce repetitive or inaccurate responses
- English only
- 1024 context window
- Educational / research project — not production-ready
Lessons learned (for a potential v0.2)
- Train longer: aim for 20B+ tokens (Chinchilla-optimal for 1.3B would be ~25B).
- Pick a single-token end-of-turn marker from the start. We initially trained with
<|end|>(5 BPE tokens), which broke single-token EOS frameworks. Patching it after the fact via a 200-step LoRA worked, but designing with a single token like<|endoftext|>would have been cleaner. - SwiGLU + RMSNorm + GQA + RoPE architecture is correct, no issues there, confirmed by the fact that our scaling follows the expected curve.
License
Apache 2.0
- Downloads last month
- 257
Model tree for YoAbriel/KodaLite-1.3B
Evaluation results
- accuracy on HellaSwag (zero-shot)self-reported0.257
- accuracy on ARC-Easy (zero-shot)self-reported0.328
- accuracy on ARC-Challenge (zero-shot)self-reported0.215
- accuracy on WinoGrande (zero-shot)self-reported0.496
- accuracy on PIQA (zero-shot)self-reported0.589
- accuracy on BoolQ (zero-shot)self-reported0.443
- accuracy on OpenBookQA (zero-shot)self-reported0.250
- accuracy on LAMBADA (OpenAI, zero-shot)self-reported0.182