AGILLM-3-Large v2

Continuation of AGILLM-3-Large training, restarted from a known-good checkpoint after a critical tokenizer bug was discovered.

What happened to v1?

A transformers library update (to v5.3.0) on 2026-03-11 silently broke the DeepSeek-V3.2 tokenizer's encode/decode pipeline:

  • Root cause: The tokenizer's Metaspace pre-tokenizer was configured to use (U+2581, SentencePiece convention) for space replacement, but the BPE vocabulary uses Ġ (U+0120, GPT-2 convention). This mismatch caused:

    • Encoding: All spaces were silently dropped. "Water boils" encoded to ['Water', 'bo', 'ils'] instead of ['ĠWater', 'Ġboils']
    • Decoding: tok.decode() lost all spaces. Round-trip encode→decode of "The meaning of life" returned "Themeaningoflife"
    • Training data corruption: ~3 billion tokens of training data were fed to the model without any space information, causing the model weights to degrade
  • Detection: Model output went from coherent English (step 12,373,125) to space-less garbled text (step 12,528,061+). Took investigation to trace back to the tokenizer library update.

  • Fix: Pinned transformers==4.48.0 (+ tokenizers==0.21.4), which correctly handles the Ġ space prefix. Also added a runtime fix in n.py that patches the ▁→Ġ mismatch if detected.

This repo

Resumes training from step 12,373,125 (~10.83B tokens, 30.9%) — the last checkpoint with correctly-encoded training data.

Model

Parameter Value
Parameters 698M
Hidden dim 1024
Layers 24
Heads 16
Rank 128
Expansion ratio 2.0x
Vocab 128,815 (DeepSeek-V3.2 tokenizer)
Architecture Joint AR + SAT (autoregressive + span-aware transformer)
Training target 35B tokens
Tokens seen (at restart) ~10.83B (30.9%)

Training setup

  • GPU: RTX 4090 (Vast.ai, ~$0.27/hr)
  • Speed: ~20,000 tok/s
  • Block size: 1122
  • Batch size: 1
  • Mixed precision: AMP (BF16)
  • Optimizer: AdamW (LR core=5e-5, LR head=2e-4)
  • Data: Streamed from multiple HuggingFace datasets (web crawl, cleaned text)

Important: Tokenizer compatibility

This model requires transformers<=4.48.0 for correct tokenizer behavior:

pip install transformers==4.48.0 tokenizers==0.21.4

Or use the runtime fix in n.py which auto-patches the ▁/Ġ mismatch.

Sample output (step 12,373,125)

Prompt: "Water boils at one hundred degrees"

Water boils at one hundred degrees in the 1990s, and a year after that. "It's not just that, but it makes you think." He said: "I don't think I'm going to make a deal for the rest of my life."

Inference examples

This repository currently stores raw PyTorch training checkpoints and delta checkpoints. It is not a drop-in transformers.pipeline() model yet. Use the AGILLM training/inference script (nB300.py) with the pinned tokenizer stack above.

Download the newest checkpoint artifact

from huggingface_hub import hf_hub_download, list_repo_files

repo_id = "MarxistLeninist/AGILLM-3-large-v2"

def step_number(path: str) -> int:
    return int(path.rsplit("step", 1)[1].split(".", 1)[0])

checkpoint_files = [
    path for path in list_repo_files(repo_id)
    if path.startswith("pretrain_delta_step") and path.endswith(".pt")
]
latest = max(checkpoint_files, key=step_number)
ckpt_path = hf_hub_download(repo_id=repo_id, filename=latest)
print(ckpt_path)

If a full pretrain_step*.pt checkpoint is available, prefer it for optimizer-preserving resumes. Delta checkpoints are weight-only and are fine for text generation.

Greedy factual completion

python nB300.py infer \
  --mode ar \
  --ckpt ./pretrain_delta_step27775678.pt \
  --prompt "Water boils at one hundred degrees" \
  --max_new 120 \
  --greedy \
  --plain-output

Nucleus-sampled continuation

python nB300.py infer \
  --mode ar \
  --ckpt ./pretrain_delta_step27775678.pt \
  --prompt "The history of machine learning began with" \
  --max_new 180 \
  --temperature 0.7 \
  --top_p 0.9 \
  --repetition_penalty 1.3 \
  --frequency_penalty 0.3 \
  --plain-output

SAT mode creative continuation

python nB300.py infer \
  --mode sat \
  --ckpt ./pretrain_delta_step27775678.pt \
  --prompt "In a quiet orbital station above Jupiter," \
  --max_new 180 \
  --temperature 0.5 \
  --top_k 30 \
  --top_p 0.9 \
  --presence_penalty 0.6 \
  --frequency_penalty 1.0 \
  --plain-output

Prompt examples to try

Use case Prompt
Short factual completion The capital of France is
Scientific prose Photosynthesis is the process by which
Narrative continuation The old radio began speaking just after midnight, and
Instruction-style completion Write a concise explanation of gradient descent:
Dialogue continuation User: What causes rain?\nAssistant:

The model is still in pretraining, so outputs should be treated as experimental continuations rather than instruction-tuned answers.

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support