Instructions to use YoAbriel/KodaLite-1.3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use YoAbriel/KodaLite-1.3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="YoAbriel/KodaLite-1.3B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("YoAbriel/KodaLite-1.3B")
model = AutoModelForCausalLM.from_pretrained("YoAbriel/KodaLite-1.3B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use YoAbriel/KodaLite-1.3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "YoAbriel/KodaLite-1.3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "YoAbriel/KodaLite-1.3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/YoAbriel/KodaLite-1.3B

SGLang

How to use YoAbriel/KodaLite-1.3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "YoAbriel/KodaLite-1.3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "YoAbriel/KodaLite-1.3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "YoAbriel/KodaLite-1.3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "YoAbriel/KodaLite-1.3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use YoAbriel/KodaLite-1.3B with Docker Model Runner:
```
docker model run hf.co/YoAbriel/KodaLite-1.3B
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

KodaLite-1.3B (Koda-v0.1)

A 1.27B parameter LLaMA-style decoder-only language model, trained entirely from scratch on 2x NVIDIA L40S GPUs using JAX + Flax NNX, then converted to HuggingFace Transformers format.

TL;DR — KodaLite reaches ~37% average accuracy on standard LLM benchmarks. It is severely undertrained (only 1.64B tokens vs 40B–3T for comparable models), which places it just below GPT-2-124M despite having 10× more parameters. A nice illustration of the Chinchilla scaling law: tokens matter more than parameters at this budget.

Benchmark results (zero-shot, 8 standard tasks)

Evaluated against 8 comparable ~1B-parameter models on the same benchmarks (HellaSwag, ARC-E/C, WinoGrande, PIQA, BoolQ, OpenBookQA, LAMBADA-OpenAI).

Rank	Model	Params	Train tokens	Avg accuracy
1	TinyLlama-1.1B	1.10B	3000B	50.3%
2	Pythia-1.4B	1.41B	300B	50.2%
3	GPT-2-XL	1.56B	40B	49.4%
4	OPT-1.3B	1.32B	180B	49.1%
5	Pythia-1B	1.01B	300B	47.6%
6	GPT-2-large	0.77B	40B	46.2%
7	GPT-2-medium	0.35B	40B	44.2%
8	GPT-2-124m	0.12B	40B	39.7%
9	KodaLite-1.3B	1.27B	1.64B	36.8%

Per-task breakdown

Task	KodaLite-1.3B	GPT-2-124M	GPT-2-XL	Pythia-1.4B	TinyLlama-1.1B	Random
HellaSwag	25.65	29.22	47.94	49.21	56.2	25.0
ARC-Easy	32.79	38.30	50.80	51.73	43.9	25.0
ARC-Challenge	21.50	22.70	28.16	29.01	30.0	25.0
WinoGrande	49.57	49.49	51.93	52.88	52.2	50.0
PIQA	58.92	62.24	70.89	71.22	72.1	50.0
BoolQ	44.34	49.76	61.59	63.70	60.6	50.0
OpenBookQA	25.00	26.40	34.20	33.40	37.2	25.0
LAMBADA (acc / ppl)	18.22 / 93.8	30.84 / 17.5	50.79 / 6.4	61.03 / 3.8	—	—

Why KodaLite scores below GPT-2-124M (despite being 10× bigger)

The Chinchilla scaling law (DeepMind, 2022) states that a model with N parameters needs approximately 20×N training tokens to be well-trained:

Model	Params	Chinchilla target (~20× params)	Actual tokens	Ratio
KodaLite-1.3B	1.27B	~25B	1.64B	6.5 % 🔴
GPT-2-XL	1.5B	~30B	40B	133 %
Pythia-1.4B	1.4B	~28B	300B	1070 %
TinyLlama-1.1B	1.1B	~22B	3000B	13600 %

KodaLite has seen only 6.5% of what it would need to be competitive. A bigger but undertrained model scores lower than a smaller but well-trained one. The LAMBADA perplexity (94 vs 17 for GPT-2-124M) is the clearest signal: the base language modeling is not converged.

On PIQA (physical commonsense) the gap is smallest — that kind of knowledge appears to be learned faster than factual knowledge or precise language modeling.

Chat Format

<|user|>
Your question
<|assistant|>
Model response
<|end|><|endoftext|>

A short LoRA pass (May 2026) taught the model to emit <|endoftext|> (50256) right after <|end|>, so generation now stops natively on EOS in Transformers, MLX, llama.cpp, Ollama, and LM Studio, without any stop_strings workaround.

Usage (Transformers)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("YoAbriel/KodaLite-1.3B")
model = AutoModelForCausalLM.from_pretrained(
    "YoAbriel/KodaLite-1.3B", dtype=torch.bfloat16, device_map="auto"
)

msg = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tok.apply_chat_template(msg, tokenize=False, add_generation_prompt=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)

out = model.generate(
    **inputs, max_new_tokens=150, do_sample=True, temperature=0.7, top_k=40,
)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Usage (MLX — Apple Silicon)

See YoAbriel/KodaLite-1.3B-mlx.

from mlx_lm import load, generate

model, tok = load("YoAbriel/KodaLite-1.3B-mlx-8bit")
prompt = tok.apply_chat_template(
    [{"role": "user", "content": "What is the capital of France?"}],
    tokenize=False,
)
print(generate(model, tok, prompt=prompt, max_tokens=150))

Usage (llama.cpp / Ollama / LM Studio)

See YoAbriel/KodaLite-1.3B-GGUF.

ollama run hf.co/YoAbriel/KodaLite-1.3B-GGUF:Q4_K_M

In LM Studio, just load the GGUF file. The model now emits <|endoftext|> (token 50256) at the end of every turn, so it stops natively without any Stop String configuration. Output may include the <|end|> text marker, which is harmless and easy to strip.

Architecture (LLaMA-compatible)

Component	Value
Parameters	1.27B
Layers	24
Hidden size	2048
Attention	GQA (32Q / 8KV heads)
Head dim	64
FFN	SwiGLU, intermediate 5504
Normalization	RMSNorm (pre-norm)
Position	RoPE (theta=10000)
Context	1024 tokens
Vocab	50,257 (GPT-2 BPE)

Training

Pre-training

Dataset: SlimPajama-6B (streaming)
Tokens seen: 1.64B
Hardware: 2x NVIDIA L40S (96GB VRAM total)
Precision: bfloat16
Framework: JAX + Flax NNX (trained from scratch, no base model)

SFT

Datasets: Databricks Dolly-15K + OpenAssistant OASST1
Method: LoRA (rank=16, alpha=32), then merged into base weights
End-of-turn: <|end|> (5 BPE tokens) followed by <|endoftext|> (token 50256, the GPT-2 EOS)

EOS fine-tune (May 2026)

Goal: teach the model to emit <|endoftext|> (50256) right after <|end|> so any framework with single-token EOS support (GGUF, MLX, Transformers) can stop natively.
Method: 200 extra LoRA steps on the existing SFT corpus with <|endoftext|> appended after each <|end|> boundary.
Result: 5/5 MLX 8bit and llama.cpp tests stop on EOS without stop_strings workarounds.

Limitations

Severely undertrained (6.5% of Chinchilla-optimal) — factual accuracy is low
May produce repetitive or inaccurate responses
English only
1024 context window
Educational / research project — not production-ready

Lessons learned (for a potential v0.2)

Train longer: aim for 20B+ tokens (Chinchilla-optimal for 1.3B would be ~25B).
Pick a single-token end-of-turn marker from the start. We initially trained with <|end|> (5 BPE tokens), which broke single-token EOS frameworks. Patching it after the fact via a 200-step LoRA worked, but designing with a single token like <|endoftext|> would have been cleaner.
SwiGLU + RMSNorm + GQA + RoPE architecture is correct, no issues there, confirmed by the fact that our scaling follows the expected curve.

License

Apache 2.0

Downloads last month: 257

Safetensors

Model size

1B params

Tensor type

BF16

Model tree for YoAbriel/KodaLite-1.3B

Finetunes

1 model

Quantizations

4 models

Evaluation results

accuracy on HellaSwag (zero-shot)
self-reported

0.257
accuracy on ARC-Easy (zero-shot)
self-reported

0.328
accuracy on ARC-Challenge (zero-shot)
self-reported

0.215
accuracy on WinoGrande (zero-shot)
self-reported

0.496
accuracy on PIQA (zero-shot)
self-reported

0.589
accuracy on BoolQ (zero-shot)
self-reported

0.443
accuracy on OpenBookQA (zero-shot)
self-reported

0.250
accuracy on LAMBADA (OpenAI, zero-shot)
self-reported

0.182