DOROFEEVA-gpt-oss-tokenizer

🎵 Назва натхнена піснею:

Watch Video

Overview

A specialized tokenizer based on openai/gpt-oss-20b, optimized for Ukrainian language processing. Adds 45,333 Ukrainian tokens from lapa-llm/tokenizer while maintaining the original 199,998 vocabulary size by replacing tokens from writing systems geographically and culturally distant from Ukraine.

Key Features

  1. +45,333 new Cyrillic BPE tokens:

  2. No removal of English or EU language tokens - only non-essential tokens from distant writing systems were replaced

  3. Latin-safe removal - preserved Latin tokens with diacritics used in multiple European languages (ü, ö, ç, ã, á, é, etc.)

  4. Identical specifications - vocab size (199,998), byte-level BPE encoding match original GPT-OSS 20B

Replaced Tokens by Writing System

The tokenizer replaced tokens from writing systems with low relevance to Ukrainian:

Writing System Tokens removed Tokens retained
Arabic 7,144 818
Han (Chinese) 5,323 1,422
Devanagari (Hindi) 3,100 808
Hebrew 1,865 462
Bengali 1,633 440
Hangul (Korean) 1,562 336
Armenian 1,366 307
Malayalam 1,293 318
Gujarati 1,255 315
Thai 1,213 296
Telugu 1,018 255
Kannada 1,016 252
Tamil 755 189
Japanese (Hiragana/Katakana) 515 167
Sinhala 212 54
Gurmukhi 209 65
Khmer 199 98
Myanmar 170 99

Fully preserved:

  • Latin scripts (English + European languages)
  • Greek (for math/science)
  • Common punctuation and emojis

Metrics

Acknowledgement: evaluation results provided by Andrii Sameliuk

lang-uk/malyuk [100k] allenai/c4(en) [100k] allenai/c4 (es,fr,it,de) [100k] QIRIM/crh (Cyrillic) [94] allenai/c4(ru) [100k] allenai/c4(bg) [100k] allenai/c4(be) [100k]
words count 22,898,164 36,170,971 198,173,216 1,868,259 42,557,519 44,627,199 43,153,645
tokenizers tokens toks/word tokens toks/word tokens toks/word tokens toks/word tokens toks/word tokens toks/word tokens toks/word
Qwen/Qwen3-8B 84,408,084 3.686 46,884,593 1.296 395,581,536 1.996 7,956,741 4.259 116,115,062 2.728 132,597,427 2.971 173,571,099 4.022
meta-llama/Llama-3.1-8B-Instruct 57,226,997 2.499 46,085,724 1.274 382,143,751 1.928 7,386,873 3.954 104,974,733 2.467 119,123,733 2.669 150,189,294 3.48
microsoft/Phi-4-mini-instruct 59,447,036 2.596 45,423,925 1.256 335,188,687 1.691 5,995,822 3.209 91,824,464 2.158 102,472,523 2.296 119,587,038 2.771
CohereLabs/aya-expanse-8b 50,973,632 2.226 47,364,187 1.309 353,221,932 1.782 6,614,719 3.541 93,089,697 2.187 112,612,668 2.523 141,262,943 3.273
google/gemma-3-12b-it 57,388,402 2.506 47,285,432 1.307 354,241,840 1.788 6,240,944 3.341 95,520,817 2.245 103,950,626 2.329 131,398,147 3.045
openai/gpt-oss-20b 59,447,036 2.596 45,423,925 1.256 335,188,687 1.691 5,995,822 3.209 91,824,464 2.158 102,472,523 2.296 119,587,038 2.771
DOROFEEVA-gpt-oss (Ours) 37,679,507 1.646🤩 45,445,425 1.256 335,248,951 1.692 6,192,235 3.314 101,014,757 2.374 108,556,986 2.433 135,787,277 3.147
Comments ~1.6x improvement for Ukrainian: faster inference/training and larger effective context windowEnglish unchangedEU languages unchangedQIRIM slightly worseRussian drops (UA-centric)Bulgarian drops slightlyBelarusian drops

Usage Example

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "path/to/gpt_oss_ukrainian_with_lapa_v3"
)

toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
print(toks.input_ids)  # Only 4 tokens! (1 per word)

Model Contents

  • tokenizer.json - Byte-level tokenizer spec (199,998 tokens, 418,423 merges)
  • tokenizer_config.json - Configuration metadata
  • special_tokens_map.json - Special token mappings (identical to GPT-OSS)
  • merge_info.json - Information about removed and added tokens
  • README.md - This file

Embedding Initialization

For newly added tokens, you can:

  • Use tools like Focus or Zett
  • Initialize embeddings randomly with warm-up schedule training
  • Unchanged tokens retain original IDs and can reuse existing embeddings

Citation

@misc{zaduha2026post9792,
  author       = "{Bohdan Didenko}",
  title        = "{Post \#9792 on Telegram Channel Zaduha}",
  howpublished = "\url{https://t.me/zaduha/9792}",
  month        = january,
  year         = {2026},
  note         = "[Online; accessed 31 January 2026]"
}

Base Models

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for transhumanist-already-exists/DOROFEEVA-gpt-oss-tokenizer

Base model

openai/gpt-oss-20b
Finetuned
(480)
this model

Collection including transhumanist-already-exists/DOROFEEVA-gpt-oss-tokenizer