AGILLM-3-Large v2

Continuation of AGILLM-3-Large training, restarted from a known-good checkpoint after a critical tokenizer bug was discovered.

What happened to v1?

A transformers library update (to v5.3.0) on 2026-03-11 silently broke the DeepSeek-V3.2 tokenizer's encode/decode pipeline:

Root cause: The tokenizer's Metaspace pre-tokenizer was configured to use ▁ (U+2581, SentencePiece convention) for space replacement, but the BPE vocabulary uses Ġ (U+0120, GPT-2 convention). This mismatch caused:
- Encoding: All spaces were silently dropped. "Water boils" encoded to ['Water', 'bo', 'ils'] instead of ['ĠWater', 'Ġboils']
- Decoding: tok.decode() lost all spaces. Round-trip encode→decode of "The meaning of life" returned "Themeaningoflife"
- Training data corruption: ~3 billion tokens of training data were fed to the model without any space information, causing the model weights to degrade
Detection: Model output went from coherent English (step 12,373,125) to space-less garbled text (step 12,528,061+). Took investigation to trace back to the tokenizer library update.
Fix: Pinned transformers==4.48.0 (+ tokenizers==0.21.4), which correctly handles the Ġ space prefix. Also added a runtime fix in n.py that patches the ▁→Ġ mismatch if detected.

This repo

Resumes training from step 12,373,125 (~10.83B tokens, 30.9%) — the last checkpoint with correctly-encoded training data.

Model

Parameter	Value
Parameters	698M
Hidden dim	1024
Layers	24
Heads	16
Rank	128
Expansion ratio	2.0x
Vocab	128,815 (DeepSeek-V3.2 tokenizer)
Architecture	Joint AR + SAT (autoregressive + span-aware transformer)
Training target	35B tokens
Tokens seen (at restart)	~10.83B (30.9%)

Training setup

GPU: RTX 4090 (Vast.ai, ~$0.27/hr)
Speed: ~20,000 tok/s
Block size: 1122
Batch size: 1
Mixed precision: AMP (BF16)
Optimizer: AdamW (LR core=5e-5, LR head=2e-4)
Data: Streamed from multiple HuggingFace datasets (web crawl, cleaned text)

Important: Tokenizer compatibility

This model requires transformers<=4.48.0 for correct tokenizer behavior:

pip install transformers==4.48.0 tokenizers==0.21.4

Or use the runtime fix in n.py which auto-patches the ▁/Ġ mismatch.

Sample output (step 12,373,125)

Prompt: "Water boils at one hundred degrees"

Water boils at one hundred degrees in the 1990s, and a year after that. "It's not just that, but it makes you think." He said: "I don't think I'm going to make a deal for the rest of my life."

Inference examples

This repository currently stores raw PyTorch training checkpoints and delta checkpoints. It is not a drop-in transformers.pipeline() model yet. Use the AGILLM training/inference script (nB300.py) with the pinned tokenizer stack above.

Download the newest checkpoint artifact

from huggingface_hub import hf_hub_download, list_repo_files

repo_id = "MarxistLeninist/AGILLM-3-large-v2"

def step_number(path: str) -> int:
    return int(path.rsplit("step", 1)[1].split(".", 1)[0])

checkpoint_files = [
    path for path in list_repo_files(repo_id)
    if path.startswith("pretrain_delta_step") and path.endswith(".pt")
]
latest = max(checkpoint_files, key=step_number)
ckpt_path = hf_hub_download(repo_id=repo_id, filename=latest)
print(ckpt_path)

If a full pretrain_step*.pt checkpoint is available, prefer it for optimizer-preserving resumes. Delta checkpoints are weight-only and are fine for text generation.

Greedy factual completion

python nB300.py infer \
  --mode ar \
  --ckpt ./pretrain_delta_step27775678.pt \
  --prompt "Water boils at one hundred degrees" \
  --max_new 120 \
  --greedy \
  --plain-output

Nucleus-sampled continuation

python nB300.py infer \
  --mode ar \
  --ckpt ./pretrain_delta_step27775678.pt \
  --prompt "The history of machine learning began with" \
  --max_new 180 \
  --temperature 0.7 \
  --top_p 0.9 \
  --repetition_penalty 1.3 \
  --frequency_penalty 0.3 \
  --plain-output

SAT mode creative continuation

python nB300.py infer \
  --mode sat \
  --ckpt ./pretrain_delta_step27775678.pt \
  --prompt "In a quiet orbital station above Jupiter," \
  --max_new 180 \
  --temperature 0.5 \
  --top_k 30 \
  --top_p 0.9 \
  --presence_penalty 0.6 \
  --frequency_penalty 1.0 \
  --plain-output

Prompt examples to try

Use case	Prompt
Short factual completion	`The capital of France is`
Scientific prose	`Photosynthesis is the process by which`
Narrative continuation	`The old radio began speaking just after midnight, and`
Instruction-style completion	`Write a concise explanation of gradient descent:`
Dialogue continuation	`User: What causes rain?\nAssistant:`

The model is still in pretraining, so outputs should be treated as experimental continuations rather than instruction-tuned answers.

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support