DOROFEEVA-gpt-oss-tokenizer

🎵 Назва натхнена піснею:

Overview

A specialized tokenizer based on openai/gpt-oss-20b, optimized for Ukrainian language processing. Adds 45,333 Ukrainian tokens from lapa-llm/tokenizer while maintaining the original 199,998 vocabulary size by replacing tokens from writing systems geographically and culturally distant from Ukraine.

Key Features

+45,333 new Cyrillic BPE tokens:
- 152 base Cyrillic letters
- 45,181 Ukrainian word tokens from lapa-llm/tokenizer
No removal of English or EU language tokens - only non-essential tokens from distant writing systems were replaced
Latin-safe removal - preserved Latin tokens with diacritics used in multiple European languages (ü, ö, ç, ã, á, é, etc.)
Identical specifications - vocab size (199,998), byte-level BPE encoding match original GPT-OSS 20B

Replaced Tokens by Writing System

The tokenizer replaced tokens from writing systems with low relevance to Ukrainian:

Writing System	Tokens removed	Tokens retained
Arabic	7,144	818
Han (Chinese)	5,323	1,422
Devanagari (Hindi)	3,100	808
Hebrew	1,865	462
Bengali	1,633	440
Hangul (Korean)	1,562	336
Armenian	1,366	307
Malayalam	1,293	318
Gujarati	1,255	315
Thai	1,213	296
Telugu	1,018	255
Kannada	1,016	252
Tamil	755	189
Japanese (Hiragana/Katakana)	515	167
Sinhala	212	54
Gurmukhi	209	65
Khmer	199	98
Myanmar	170	99

Fully preserved:

Latin scripts (English + European languages)
Greek (for math/science)
Common punctuation and emojis

Metrics

Acknowledgement: evaluation results provided by Andrii Sameliuk

	lang-uk/malyuk [100k]		allenai/c4(en) [100k]		allenai/c4 (es,fr,it,de) [100k]		QIRIM/crh (Cyrillic) [94]		allenai/c4(ru) [100k]		allenai/c4(bg) [100k]		allenai/c4(be) [100k]
words count	22,898,164		36,170,971		198,173,216		1,868,259		42,557,519		44,627,199		43,153,645

tokenizers	tokens	toks/word	tokens	toks/word	tokens	toks/word	tokens	toks/word	tokens	toks/word	tokens	toks/word	tokens	toks/word
Qwen/Qwen3-8B	84,408,084	3.686	46,884,593	1.296	395,581,536	1.996	7,956,741	4.259	116,115,062	2.728	132,597,427	2.971	173,571,099	4.022
meta-llama/Llama-3.1-8B-Instruct	57,226,997	2.499	46,085,724	1.274	382,143,751	1.928	7,386,873	3.954	104,974,733	2.467	119,123,733	2.669	150,189,294	3.48
microsoft/Phi-4-mini-instruct	59,447,036	2.596	45,423,925	1.256	335,188,687	1.691	5,995,822	3.209	91,824,464	2.158	102,472,523	2.296	119,587,038	2.771
CohereLabs/aya-expanse-8b	50,973,632	2.226	47,364,187	1.309	353,221,932	1.782	6,614,719	3.541	93,089,697	2.187	112,612,668	2.523	141,262,943	3.273
google/gemma-3-12b-it	57,388,402	2.506	47,285,432	1.307	354,241,840	1.788	6,240,944	3.341	95,520,817	2.245	103,950,626	2.329	131,398,147	3.045
openai/gpt-oss-20b	59,447,036	2.596	45,423,925	1.256	335,188,687	1.691	5,995,822	3.209	91,824,464	2.158	102,472,523	2.296	119,587,038	2.771
DOROFEEVA-gpt-oss (Ours)	37,679,507	1.646🤩	45,445,425	1.256	335,248,951	1.692	6,192,235	3.314	101,014,757	2.374	108,556,986	2.433	135,787,277	3.147
Comments	~1.6x improvement for Ukrainian: faster inference/training and larger effective context window		English unchanged		EU languages unchanged		QIRIM slightly worse		Russian drops (UA-centric)		Bulgarian drops slightly		Belarusian drops

Usage Example

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "path/to/gpt_oss_ukrainian_with_lapa_v3"
)

toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
print(toks.input_ids)  # Only 4 tokens! (1 per word)

Model Contents

tokenizer.json - Byte-level tokenizer spec (199,998 tokens, 418,423 merges)
tokenizer_config.json - Configuration metadata
special_tokens_map.json - Special token mappings (identical to GPT-OSS)
merge_info.json - Information about removed and added tokens
README.md - This file

Embedding Initialization

For newly added tokens, you can:

Use tools like Focus or Zett
Initialize embeddings randomly with warm-up schedule training
Unchanged tokens retain original IDs and can reuse existing embeddings

Citation

@misc{zaduha2026post9792,
  author       = "{Bohdan Didenko}",
  title        = "{Post \#9792 on Telegram Channel Zaduha}",
  howpublished = "\url{https://t.me/zaduha/9792}",
  month        = january,
  year         = {2026},
  note         = "[Online; accessed 31 January 2026]"
}

Base Models

Base: openai/gpt-oss-20b
Donor: lapa-llm/tokenizer

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for transhumanist-already-exists/DOROFEEVA-gpt-oss-tokenizer

Base model

openai/gpt-oss-20b

Finetuned

(480)

this model

Collection including transhumanist-already-exists/DOROFEEVA-gpt-oss-tokenizer

Tokenizers

Collection

6 items • Updated 24 days ago • 2

transhumanist-already-exists
/

DOROFEEVA-gpt-oss-tokenizer