DwarfGoToken

A compact BPE tokenizer (8,192 tokens) designed for tiny language models that need to understand shell commands, code snippets, and ChatML-formatted conversations. Built on top of a custom Go pre‑tokenizer that keeps critical shell tokens (grep, chmod, 2>&1, -rf, …) atomic, avoiding the fragmentation that kills performance on CPU-bound inference.

Why 8,192 tokens?

For a small LM (<20M parameters), a large vocabulary (e.g., 64K) wastes the majority of the model’s parameters on the embedding matrix. With d_model=256, the embedding here accounts for only 2.1M parameters (~14%) — the rest goes into the transformer layers, where it matters most.

Corpus

Source Domain Lines
bigcode/the-stack-dedup/shell Shell 1,500,000
bigcode/the-stack-dedup/batchfile Batch 500,000
bigcode/the-stack-dedup/python Python 1,000,000
bigcode/the-stack-dedup/c C 500,000
m-a-p/CodeFeedback-Filtered-Instruction Code+Instructions 200,000
HuggingFaceH4/helpful-instructions English instructions 150,000
HuggingFaceFW/fineweb/sample-10BT Web English 300,000
Magpie-Align/Magpie-Reasoning-150K Chain-of-Thought 200,000

Total: 4,251,427 lines (3.5 GB) — 47% Shell, 40% Code, 9.5% EN, 3.5% CoT.

Special tokens (all atomic)

<s>, </s>, <unk>, <pad>, <|system|>, <|user|>, <|assistant|>, <|end|>, <|thinking|>, <|/thinking|>, plus 54 Go‑pre‑tokenizer tokens (e.g., grep, chmod, 2>&1, &&, >>, -rf, --help).

Quick test

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("ThingAI/DwarfGoToken")

# Shell commands stay atomic
tok.tokenize("find /var/log -name '*.gz' | xargs rm -rf")
# → ['find', '/', 'var', '/', 'log', '-n', 'ame', "'", '*.', 'gz', "'", '|', 'xargs', 'rm', '-rf']

# ChatML template
tok.tokenize("<|user|>\nCosa fa grep?\n<|end|>\n<|assistant|>\n...")
# → ['<|user|>', 'C', 'os', 'a', 'fa', 'grep', '?', '<|end|>', '<|assistant|>', '...']

Usage

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("ThingAI/DwarfGoToken")

Intended use

This tokenizer was built to pair with tiny LMs (~10–20M parameters) specialised in command‑line assistance, shell scripting, or code generation. It’s the companion of the Dwarf model family by ThingsAI.

License

Apache 2.0use it, modify it, ship it.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train ThingAI/DwarfGoToken