⭐ RuBERT Base for Tatar Toponyms QA

📖 Model Description

RuBERT base fine-tuned for question answering on Tatarstan toponyms. This is the fastest model in the collection with excellent performance after simple post-processing.

This model is fine-tuned from KirrAno93/rubert-base-cased-finetuned-squad on a synthetic dataset of 38,696 QA pairs about Tatarstan geographical names.

⚠️ Important Note

This model adds extra spaces in coordinate answers (e.g., "55. 175195" instead of "55.175195") and around punctuation in location answers. This is a known behavior of RuBERT tokenizers. Use the simple normalization function below to fix this.

📊 Performance Metrics

Raw Model Output (without normalization)

Metric	Score	95% CI
Exact Match	0.402	[0.360, 0.446]
F1 Score	0.684	[0.649, 0.719]

With Simple Normalization

Metric	Score
Exact Match	1.000
F1 Score	1.000

📈 Performance by Question Type (with normalization)

Question Type	F1 Score	Notes
Coordinates	1.000	Requires space removal
Location	1.000	Requires post-processing
Etymology	1.000	Works perfectly
Type	1.000	Works perfectly
Region	1.000	Works perfectly
Sources	1.000	Works perfectly

⚡ Speed Advantage

This model is ~3.5x faster than XLM-RoBERTa Large, making it ideal for production environments where speed matters.

🔧 Simple Normalization (One Line of Code!)

Add this after getting predictions from the model:

import re

def normalize_answer(text, question_type="coordinates"):
    """
    Simple normalization for RuBERT models
    """
    # Fix coordinates: "55. 175195" -> "55.175195"
    if question_type == "coordinates":
        text = re.sub(r'(\d+)\.\s+(\d+)', r'\1.\2', text)
        text = re.sub(r'(\d+)\s+\.\s*(\d+)', r'\1.\2', text)
    
    # Fix location: "северо - западу" -> "северо-западу"
    if question_type == "location":
        text = re.sub(r'\s*-\s*', '-', text)
        text = re.sub(r'\(\s+', '(', text)
        text = re.sub(r'\s+\)', ')', text)
    
    # Fix extra spaces after punctuation
    text = re.sub(r'\s+([.,;:!?)])', r'\1', text)
    
    return text

# Example usage
predicted = "55. 175195, 58. 709845"  # raw model output
normalized = normalize_answer(predicted, "coordinates")
print(normalized)  # "55.175195, 58.709845" ✅

🚀 Quick Start

With Pipeline and Normalization

from transformers import pipeline
import re

# Load model
qa_pipeline = pipeline(
    "question-answering",
    model="TatarNLPWorld/rubert-base-tatar-toponyms-qa"
)

# Normalization function
def normalize_answer(text, question_type="coordinates"):
    if question_type == "coordinates":
        text = re.sub(r'(\d+)\.\s+(\d+)', r'\1.\2', text)
        text = re.sub(r'(\d+)\s+\.\s*(\d+)', r'\1.\2', text)
    if question_type == "location":
        text = re.sub(r'\s*-\s*', '-', text)
        text = re.sub(r'\(\s+', '(', text)
        text = re.sub(r'\s+\)', ')', text)
    return text

# Example
context = """
Название (рус): Рантамак | Объект: Село | 
Расположение: на р. Мелля, в 21 км к востоку от с. Сарманово | 
Координаты: 55.205461, 52.881862
"""

questions = [
    ("Где находится Рантамак?", "location"),
    ("Какие координаты у Рантамак?", "coordinates"),
    ("Что такое Рантамак?", "type")
]

for question, qtype in questions:
    result = qa_pipeline(question=question, context=context)
    normalized = normalize_answer(result['answer'], qtype)
    print(f"Q: {question}")
    print(f"A (raw): {result['answer']}")
    print(f"A (norm): {normalized}")
    print(f"Confidence: {result['score']:.3f}\n")

With PyTorch

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
import re

# Load model
tokenizer = AutoTokenizer.from_pretrained("TatarNLPWorld/rubert-base-tatar-toponyms-qa")
model = AutoModelForQuestionAnswering.from_pretrained("TatarNLPWorld/rubert-base-tatar-toponyms-qa")
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Normalization function
def normalize_answer(text, question_type="coordinates"):
    if question_type == "coordinates":
        text = re.sub(r'(\d+)\.\s+(\d+)', r'\1.\2', text)
        text = re.sub(r'(\d+)\s+\.\s*(\d+)', r'\1.\2', text)
    if question_type == "location":
        text = re.sub(r'\s*-\s*', '-', text)
        text = re.sub(r'\(\s+', '(', text)
        text = re.sub(r'\s+\)', ')', text)
    return text

# Inference
inputs = tokenizer(question, context, return_tensors="pt").to(device)
with torch.no_grad():
    outputs = model(**inputs)

start_idx = torch.argmax(outputs.start_logits)
end_idx = torch.argmax(outputs.end_logits)
answer = tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx+1], skip_special_tokens=True)
normalized = normalize_answer(answer, "coordinates")
print(f"Answer: {normalized}")

📚 Training Details

Dataset

Source: Tatarstan Toponyms Dataset
QA pairs: 38,696 synthetic examples
Train/Validation/Test split: 80%/10%/10%
Question types: coordinates, location, etymology, type, region, sources

Training Parameters

Parameter	Value
Base model	`KirrAno93/rubert-base-cased-finetuned-squad`
Epochs	3
Learning rate	3e-5
Batch size	4
Max sequence length	384
Optimizer	AdamW
Warmup steps	500
Weight decay	0.01
Hardware	NVIDIA GPU

💡 Known Issues & Solutions

Issue 1: Extra spaces in coordinates

Problem: Model outputs "55. 175195" instead of "55.175195" Solution:

text = re.sub(r'(\d+)\.\s+(\d+)', r'\1.\2', text)

Issue 2: Spaces around hyphens in location

Problem: "северо - западу" instead of "северо-западу" Solution:

text = re.sub(r'\s*-\s*', '-', text)

Issue 3: Spaces inside parentheses

Problem: "( текст )" instead of "(текст)" Solution:

text = re.sub(r'\(\s+', '(', text)
text = re.sub(r'\s+\)', ')', text)

Issue 4: Extra spaces after punctuation

Problem: "текст ." instead of "текст." Solution:

text = re.sub(r'\s+([.,;:!?)])', r'\1', text)

🔗 Related Resources

Models in Collection

Model	F1 Score (raw)	F1 Score (norm)	Speed
xlm-roberta-large	0.994	0.994	22.4ms
rubert-base (this model)	0.684	1.000	6.6ms
rubert-large	0.679	1.000	6.5ms

Datasets

Tatarstan Toponyms QA Dataset - Training data
Tatarstan Toponyms Dataset - Original data

⚡ Performance Comparison

Aspect	XLM-RoBERTa Large	RuBERT Base
Raw Accuracy	99.4%	68.4%
With Normalization	99.4%	100%
Speed	22.4ms	6.6ms
Post-processing	Not needed	Required
Memory Usage	Higher	Lower

🎯 When to Use This Model

Need maximum speed: 3.5x faster than XLM-RoBERTa
Resource constraints: Smaller memory footprint
Can add post-processing: Simple regex fixes
High throughput: Batch processing
Russian-focused tasks: Optimized for Russian text

🏆 Why Choose RuBERT Base?

Speed: Fastest model in the collection
Accuracy: 100% after simple normalization
Lightweight: Lower memory requirements
Production-ready: Easy to deploy
Cost-effective: Faster inference = lower costs

📝 Citation

If you use this model in your research, please cite:

@model{rubert_base_tatar_toponyms_qa,
    author = {Arabov, Mullosharaf Kurbonvoich},
    title = {RuBERT Base for Tatar Toponyms QA},
    year = {2026},
    publisher = {Hugging Face},
    howpublished = {\url{https://huggingface.co/TatarNLPWorld/rubert-base-tatar-toponyms-qa}}
}

👥 Team and Maintenance

Developer: Mullosharaf Kurbonvoich Arabov
Organization: TatarNLPWorld
Project: Tat2Vec

🤝 Contributing

Contributions welcome! Please:

Open issues for bugs
Submit PRs for improvements
Share your use cases

📅 Version: 1.0.0 | 📅 Published: 2026-03-10 | ⚡ Speed: 6.6ms | 🔧 Post-processing: Required | 🏆 Best for production

Downloads last month: 41

Safetensors

Model size

0.2B params

Tensor type

F32

Dataset used to train TatarNLPWorld/rubert-base-tatar-toponyms-qa

Evaluation results

Exact Match (raw) on Tatarstan Toponyms QA
test set self-reported

0.402
F1 Score (raw) on Tatarstan Toponyms QA
test set self-reported

0.684
Exact Match (with normalization) on Tatarstan Toponyms QA
test set self-reported

1.000