AGILLM-3-Large v2
Continuation of AGILLM-3-Large training, restarted from a known-good checkpoint after a critical tokenizer bug was discovered.
What happened to v1?
A transformers library update (to v5.3.0) on 2026-03-11 silently broke the DeepSeek-V3.2 tokenizer's encode/decode pipeline:
Root cause: The tokenizer's
Metaspacepre-tokenizer was configured to use▁(U+2581, SentencePiece convention) for space replacement, but the BPE vocabulary usesĠ(U+0120, GPT-2 convention). This mismatch caused:- Encoding: All spaces were silently dropped.
"Water boils"encoded to['Water', 'bo', 'ils']instead of['ĠWater', 'Ġboils'] - Decoding:
tok.decode()lost all spaces. Round-tripencode→decodeof"The meaning of life"returned"Themeaningoflife" - Training data corruption: ~3 billion tokens of training data were fed to the model without any space information, causing the model weights to degrade
- Encoding: All spaces were silently dropped.
Detection: Model output went from coherent English (step 12,373,125) to space-less garbled text (step 12,528,061+). Took investigation to trace back to the tokenizer library update.
Fix: Pinned
transformers==4.48.0(+tokenizers==0.21.4), which correctly handles theĠspace prefix. Also added a runtime fix inn.pythat patches the▁→Ġmismatch if detected.
This repo
Resumes training from step 12,373,125 (~10.83B tokens, 30.9%) — the last checkpoint with correctly-encoded training data.
Model
| Parameter | Value |
|---|---|
| Parameters | 698M |
| Hidden dim | 1024 |
| Layers | 24 |
| Heads | 16 |
| Rank | 128 |
| Expansion ratio | 2.0x |
| Vocab | 128,815 (DeepSeek-V3.2 tokenizer) |
| Architecture | Joint AR + SAT (autoregressive + span-aware transformer) |
| Training target | 35B tokens |
| Tokens seen (at restart) | ~10.83B (30.9%) |
Training setup
- GPU: RTX 4090 (Vast.ai, ~$0.27/hr)
- Speed: ~20,000 tok/s
- Block size: 1122
- Batch size: 1
- Mixed precision: AMP (BF16)
- Optimizer: AdamW (LR core=5e-5, LR head=2e-4)
- Data: Streamed from multiple HuggingFace datasets (web crawl, cleaned text)
Important: Tokenizer compatibility
This model requires transformers<=4.48.0 for correct tokenizer behavior:
pip install transformers==4.48.0 tokenizers==0.21.4
Or use the runtime fix in n.py which auto-patches the ▁/Ġ mismatch.
Sample output (step 12,373,125)
Prompt: "Water boils at one hundred degrees"
Water boils at one hundred degrees in the 1990s, and a year after that. "It's not just that, but it makes you think." He said: "I don't think I'm going to make a deal for the rest of my life."
Inference examples
This repository currently stores raw PyTorch training checkpoints and delta checkpoints. It is not a drop-in transformers.pipeline() model yet. Use the AGILLM training/inference script (nB300.py) with the pinned tokenizer stack above.
Download the newest checkpoint artifact
from huggingface_hub import hf_hub_download, list_repo_files
repo_id = "MarxistLeninist/AGILLM-3-large-v2"
def step_number(path: str) -> int:
return int(path.rsplit("step", 1)[1].split(".", 1)[0])
checkpoint_files = [
path for path in list_repo_files(repo_id)
if path.startswith("pretrain_delta_step") and path.endswith(".pt")
]
latest = max(checkpoint_files, key=step_number)
ckpt_path = hf_hub_download(repo_id=repo_id, filename=latest)
print(ckpt_path)
If a full pretrain_step*.pt checkpoint is available, prefer it for optimizer-preserving resumes. Delta checkpoints are weight-only and are fine for text generation.
Greedy factual completion
python nB300.py infer \
--mode ar \
--ckpt ./pretrain_delta_step27775678.pt \
--prompt "Water boils at one hundred degrees" \
--max_new 120 \
--greedy \
--plain-output
Nucleus-sampled continuation
python nB300.py infer \
--mode ar \
--ckpt ./pretrain_delta_step27775678.pt \
--prompt "The history of machine learning began with" \
--max_new 180 \
--temperature 0.7 \
--top_p 0.9 \
--repetition_penalty 1.3 \
--frequency_penalty 0.3 \
--plain-output
SAT mode creative continuation
python nB300.py infer \
--mode sat \
--ckpt ./pretrain_delta_step27775678.pt \
--prompt "In a quiet orbital station above Jupiter," \
--max_new 180 \
--temperature 0.5 \
--top_k 30 \
--top_p 0.9 \
--presence_penalty 0.6 \
--frequency_penalty 1.0 \
--plain-output
Prompt examples to try
| Use case | Prompt |
|---|---|
| Short factual completion | The capital of France is |
| Scientific prose | Photosynthesis is the process by which |
| Narrative continuation | The old radio began speaking just after midnight, and |
| Instruction-style completion | Write a concise explanation of gradient descent: |
| Dialogue continuation | User: What causes rain?\nAssistant: |
The model is still in pretraining, so outputs should be treated as experimental continuations rather than instruction-tuned answers.
License
Apache 2.0