Sifal (Sifal KLIOUI)

commented on Tokenization in Transformers v5: Simpler, Clearer, and More Modular 1 day ago

Thanks for doing this! I had to train some tokenizers with the v4, it was indeed not straightforward to understand the behavior.

I had two questions:

You said: older model implementations may rely on Python-specific behavior.
Curious if you had any example
You sometimes say "fast" (between quotes) is it just to refer to the fastTokenizers backend or can the implementation actually be slower than the python implementation because of some kind of rust overhead?

upvoted an article 1 day ago

Article

Tokenization in Transformers v5: Simpler, Clearer, and More Modular

+4

25 days ago

•

110

liked a Space 5 days ago

The Ultra-Scale Playbook

🌌

3.64k

The ultimate guide to training LLM on large GPU Clusters

liked a model 9 days ago

allenai/Olmo-3-7B-Instruct

Text Generation • 528k • Updated 6 days ago • 122k • • 106

liked 2 models 11 days ago

LiquidAI/LFM2-8B-A1B

Text Generation • 8B • Updated 6 days ago • 8.15k • 288

allenai/Olmo-3.1-7B-RL-Zero-Code

Text Generation • 528k • Updated 6 days ago • 264 • 17

upvoted a collection 13 days ago

Olmo 3.1

Collection

The latest members of the Olmo 3 family: another 3 weeks of RL for 32B Think, the 32B Instruct model, large post-training research datasets... • 9 items • Updated 19 days ago • 44

liked a dataset 18 days ago

osunlp/TravelPlanner

Viewer • Updated Jul 14, 2024 • 1.23k • 64.7k • 75

liked a model 24 days ago

jinaai/jina-code-embeddings-1.5b

Feature Extraction • 2B • Updated Oct 2, 2025 • 8.14k • 34

updated a dataset 27 days ago

Sifal/Kabyle-French

Viewer • Updated 27 days ago • 115k • 42 • 2

commented on Gotchas in Tokenizer Behavior Every Developer Should Know 27 days ago

Thanks for sharing, probably worth having a script to check:

import warnings
from transformers import AutoTokenizer

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

def check_tokenizer_gotchas(model_id):
    print(f"\n{'='*60}")
    print(f"Analyzing Tokenizer for: {model_id}")
    print(f"{'='*60}\n")

    try:
        # Load tokenizer (trust_remote_code=True is often needed for newer/custom models)
        tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    except Exception as e:
        print(f"Error loading tokenizer: {e}")
        return

    # Standard test input
    test_text = "Beautiful is better than ugly"
    
    # Standard test messages for Chat Templates
    messages = [
        {"role": "user", "content": "What is better than ugly?"},
        {"role": "assistant", "content": "Beautiful."}
    ]

    # --- GOTCHA 1 & 2: BOS Token Existence and Usage ---
    print(f"--- 1 & 2. BOS Token Analysis ---")
    if tokenizer.bos_token is None:
        print(f"⚠️  Gotcha #1: This tokenizer has NO BOS token defined.")
    else:
        print(f"✅  BOS token exists: '{tokenizer.bos_token}' (ID: {tokenizer.bos_token_id})")
        
        # Check usage in standard encoding
        encoded = tokenizer(test_text)["input_ids"]
        if tokenizer.bos_token_id in encoded:
             print(f"✅  BOS token IS automatically added during standard tokenization.")
        else:
            print(f"⚠️  Gotcha #2: BOS exists but is NOT added automatically.")

    # --- GOTCHA 3: EOS Token in Standard Tokenization ---
    print(f"--- 3. Standard EOS Token Analysis ---")
    encoded = tokenizer(test_text)["input_ids"]
    if tokenizer.eos_token_id and encoded[-1] == tokenizer.eos_token_id:
        print(f"ℹ️  EOS token WAS added automatically (Uncommon behavior).")
    else:
        print(f"⚠️  Gotcha #3: Tokenization did NOT add the EOS token automatically.")

    # --- GOTCHA 4: EOS in Chat Templates ---
    print(f"--- 4. Chat Template EOS Analysis ---")
    if tokenizer.chat_template:
        # Generate IDs without adding the generation prompt yet
        chat_encoded = tokenizer.apply_chat_template(messages, add_generation_prompt=False)
        
        if tokenizer.eos_token_id is None:
             print("❌  No EOS token defined in tokenizer.")
        
        elif len(chat_encoded) > 0:
            last_id = chat_encoded[-1]
            # Check if the very last token is EOS
            if last_id == tokenizer.eos_token_id:
                print(f"✅  Chat template correctly appends EOS ({tokenizer.eos_token}) at the very end.")
            
            # Check if EOS is second to last (common issue)
            elif len(chat_encoded) > 1 and chat_encoded[-2] == tokenizer.eos_token_id:
                # Decode the actual last token to show the user
                trailing_token = tokenizer.decode([last_id])
                # Escape newlines for visibility in print output
                trailing_repr = repr(trailing_token) 
                
                print(f"⚠️  Gotcha #4: EOS is present but NOT at the end.")
                print(f"    The actual last token is ID {last_id} ({trailing_repr}).")
                print(f"    (This is likely a trailing newline from the Jinja template).")
            
            else:
                print(f"⚠️  Gotcha #4: Chat template does NOT append the EOS token.")
    else:
        print("ℹ️  No chat template defined for this tokenizer.")

    # --- GOTCHA 5: PAD == EOS ---
    print(f"--- 5. Pad Token Collision Check ---")
    if tokenizer.pad_token_id is not None and tokenizer.eos_token_id is not None:
        if tokenizer.pad_token_id == tokenizer.eos_token_id:
            print(f"⚠️  Gotcha #5: PAD token ID equals EOS token ID ({tokenizer.pad_token_id}).")
            print(f"    Warning: Masking logic `input_ids == pad_token_id` will unintentionally mask EOS tokens.")
        else:
            print(f"✅  PAD ({tokenizer.pad_token_id}) and EOS ({tokenizer.eos_token_id}) are distinct.")
    else:
        print("ℹ️  PAD or EOS token not defined for this tokenizer.")

    # --- GOTCHA 6 & 7: Composition and Double Special Tokens ---
    print(f"--- 6 & 7. Chat Template Composition ---")
    if tokenizer.chat_template:
        # Step 1: Apply template directly to IDs (Correct way)
        direct_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=False)
        
        # Step 2: Apply template to string, THEN tokenize (Incorrect way often used)
        str_template = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
        composed_ids = tokenizer(str_template)["input_ids"]

        if direct_ids != composed_ids:
            print(f"⚠️  Gotcha #7: Tokenizing the output of `apply_chat_template` ADDS extra special tokens.")
            print(f"    Direct ID length: {len(direct_ids)} vs Re-tokenized length: {len(composed_ids)}")
        else:
            print(f"✅  Tokenization of chat template string matches direct ID generation.")
    else:
      print("ℹ️  No chat template defined for this tokenizer.")

# Run for all models mentioned in the text
models = [
    "Qwen/Qwen2.5-0.5B",
    "microsoft/Phi-3-mini-128k-instruct",
    "CohereLabs/aya-expanse-8b",
    "meta-llama/Llama-3.2-1B-Instruct",
    "databricks/dbrx-instruct",
    "Qwen/Qwen2.5-0.5B-Instruct"
]

for model in models:
    check_tokenizer_gotchas(model)

upvoted an article 27 days ago

Article

Gotchas in Tokenizer Behavior Every Developer Should Know

Apr 18, 2025

•

69

New activity in Sifal/Kabyle-French about 1 month ago

Wrong translation in some cases ?

1

#3 opened about 1 month ago by

Djame

commented on Model statistics of the 50 most downloaded entities on Hugging Face about 1 month ago

Very instersing example regarding CamemBERT, these were actually what I was referring to when I said "with a few exception", didn't know it was much more common, your point now on how this biases the results makes much more sense, thanks for clarifications!

commented on Model statistics of the 50 most downloaded entities on Hugging Face about 1 month ago

Thanks for the extensive reply!

Valid point about how decoders openned the door to encoders in some applications.

Thanks for sharing the article, I'll try to check it out!

Intresting that you think that it is a strong assumption, because from memory, the downloads curve of models I check on the hub flattens pretty fast after the release (with a few the exceptions)

Regarding my pervious question, the paper you just shared seems the one to actually answer my question (p20), proprtionnaly the encoders seem to have been downloaded less overtime compared to the early days, although the curve has been pretty stable in the last 3-4 years:

Download were lower than decoders in a moment ahah! probably (a) big release(s)

commented on Model statistics of the 50 most downloaded entities on Hugging Face about 1 month ago

Really intresting! Thanks for sharing! Wasn't surprised about the NLP domination, but was for that of the encoders, curious about how much this is changing given that most releases are decoders.

Side note: you seem to have missed the translation of one part (ctrl+f: présent)

upvoted an article about 1 month ago

Article

Model statistics of the 50 most downloaded entities on Hugging Face

Oct 13, 2025

•

37

liked a Space about 1 month ago

Evaluation Guidebook

📝

239

Display benchmark evaluation data for LLMs

upvoted an article about 1 month ago

Article

Entropic Instruction Following: Does Semantic Coherence Help LLMs Follow Instructions?

Dec 3, 2025

•

1

published an article about 1 month ago

Article

Entropic Instruction Following: Does Semantic Coherence Help LLMs Follow Instructions?

Dec 3, 2025

•

1

Sifal KLIOUI

AI & ML interests

Recent Activity