DOROFEEVA-gpt-oss-tokenizer
🎵 Назва натхнена піснею:
![]()
Overview
A specialized tokenizer based on openai/gpt-oss-20b, optimized for Ukrainian language processing. Adds 45,333 Ukrainian tokens from lapa-llm/tokenizer while maintaining the original 199,998 vocabulary size by replacing tokens from writing systems geographically and culturally distant from Ukraine.
Key Features
+45,333 new Cyrillic BPE tokens:
- 152 base Cyrillic letters
- 45,181 Ukrainian word tokens from lapa-llm/tokenizer
No removal of English or EU language tokens - only non-essential tokens from distant writing systems were replaced
Latin-safe removal - preserved Latin tokens with diacritics used in multiple European languages (ü, ö, ç, ã, á, é, etc.)
Identical specifications - vocab size (199,998), byte-level BPE encoding match original GPT-OSS 20B
Replaced Tokens by Writing System
The tokenizer replaced tokens from writing systems with low relevance to Ukrainian:
| Writing System | Tokens removed | Tokens retained |
|---|---|---|
| Arabic | 7,144 | 818 |
| Han (Chinese) | 5,323 | 1,422 |
| Devanagari (Hindi) | 3,100 | 808 |
| Hebrew | 1,865 | 462 |
| Bengali | 1,633 | 440 |
| Hangul (Korean) | 1,562 | 336 |
| Armenian | 1,366 | 307 |
| Malayalam | 1,293 | 318 |
| Gujarati | 1,255 | 315 |
| Thai | 1,213 | 296 |
| Telugu | 1,018 | 255 |
| Kannada | 1,016 | 252 |
| Tamil | 755 | 189 |
| Japanese (Hiragana/Katakana) | 515 | 167 |
| Sinhala | 212 | 54 |
| Gurmukhi | 209 | 65 |
| Khmer | 199 | 98 |
| Myanmar | 170 | 99 |
Fully preserved:
- Latin scripts (English + European languages)
- Greek (for math/science)
- Common punctuation and emojis
Metrics
Acknowledgement: evaluation results provided by Andrii Sameliuk
| lang-uk/malyuk [100k] | allenai/c4(en) [100k] | allenai/c4 (es,fr,it,de) [100k] | QIRIM/crh (Cyrillic) [94] | allenai/c4(ru) [100k] | allenai/c4(bg) [100k] | allenai/c4(be) [100k] | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| words count | 22,898,164 | 36,170,971 | 198,173,216 | 1,868,259 | 42,557,519 | 44,627,199 | 43,153,645 | |||||||||||||||||||||
| tokenizers | tokens | toks/word | tokens | toks/word | tokens | toks/word | tokens | toks/word | tokens | toks/word | tokens | toks/word | tokens | toks/word | ||||||||||||||
| Qwen/Qwen3-8B | 84,408,084 | 3.686 | 46,884,593 | 1.296 | 395,581,536 | 1.996 | 7,956,741 | 4.259 | 116,115,062 | 2.728 | 132,597,427 | 2.971 | 173,571,099 | 4.022 | ||||||||||||||
| meta-llama/Llama-3.1-8B-Instruct | 57,226,997 | 2.499 | 46,085,724 | 1.274 | 382,143,751 | 1.928 | 7,386,873 | 3.954 | 104,974,733 | 2.467 | 119,123,733 | 2.669 | 150,189,294 | 3.48 | ||||||||||||||
| microsoft/Phi-4-mini-instruct | 59,447,036 | 2.596 | 45,423,925 | 1.256 | 335,188,687 | 1.691 | 5,995,822 | 3.209 | 91,824,464 | 2.158 | 102,472,523 | 2.296 | 119,587,038 | 2.771 | ||||||||||||||
| CohereLabs/aya-expanse-8b | 50,973,632 | 2.226 | 47,364,187 | 1.309 | 353,221,932 | 1.782 | 6,614,719 | 3.541 | 93,089,697 | 2.187 | 112,612,668 | 2.523 | 141,262,943 | 3.273 | ||||||||||||||
| google/gemma-3-12b-it | 57,388,402 | 2.506 | 47,285,432 | 1.307 | 354,241,840 | 1.788 | 6,240,944 | 3.341 | 95,520,817 | 2.245 | 103,950,626 | 2.329 | 131,398,147 | 3.045 | ||||||||||||||
| openai/gpt-oss-20b | 59,447,036 | 2.596 | 45,423,925 | 1.256 | 335,188,687 | 1.691 | 5,995,822 | 3.209 | 91,824,464 | 2.158 | 102,472,523 | 2.296 | 119,587,038 | 2.771 | ||||||||||||||
| DOROFEEVA-gpt-oss (Ours) | 37,679,507 | 1.646🤩 | 45,445,425 | 1.256 | 335,248,951 | 1.692 | 6,192,235 | 3.314 | 101,014,757 | 2.374 | 108,556,986 | 2.433 | 135,787,277 | 3.147 | ||||||||||||||
| Comments | ~1.6x improvement for Ukrainian: faster inference/training and larger effective context window | English unchanged | EU languages unchanged | QIRIM slightly worse | Russian drops (UA-centric) | Bulgarian drops slightly | Belarusian drops | |||||||||||||||||||||
Usage Example
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"path/to/gpt_oss_ukrainian_with_lapa_v3"
)
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
print(toks.input_ids) # Only 4 tokens! (1 per word)
Model Contents
- tokenizer.json - Byte-level tokenizer spec (199,998 tokens, 418,423 merges)
- tokenizer_config.json - Configuration metadata
- special_tokens_map.json - Special token mappings (identical to GPT-OSS)
- merge_info.json - Information about removed and added tokens
- README.md - This file
Embedding Initialization
For newly added tokens, you can:
- Use tools like Focus or Zett
- Initialize embeddings randomly with warm-up schedule training
- Unchanged tokens retain original IDs and can reuse existing embeddings
Citation
@misc{zaduha2026post9792,
author = "{Bohdan Didenko}",
title = "{Post \#9792 on Telegram Channel Zaduha}",
howpublished = "\url{https://t.me/zaduha/9792}",
month = january,
year = {2026},
note = "[Online; accessed 31 January 2026]"
}
Base Models
- Base: openai/gpt-oss-20b
- Donor: lapa-llm/tokenizer
Model tree for transhumanist-already-exists/DOROFEEVA-gpt-oss-tokenizer
Base model
openai/gpt-oss-20b