⭐ RuBERT Base for Tatar Toponyms QA
📖 Model Description
RuBERT base fine-tuned for question answering on Tatarstan toponyms. This is the fastest model in the collection with excellent performance after simple post-processing.
This model is fine-tuned from KirrAno93/rubert-base-cased-finetuned-squad on a synthetic dataset of 38,696 QA pairs about Tatarstan geographical names.
⚠️ Important Note
This model adds extra spaces in coordinate answers (e.g., "55. 175195" instead of "55.175195") and around punctuation in location answers. This is a known behavior of RuBERT tokenizers. Use the simple normalization function below to fix this.
📊 Performance Metrics
Raw Model Output (without normalization)
| Metric | Score | 95% CI |
|---|---|---|
| Exact Match | 0.402 | [0.360, 0.446] |
| F1 Score | 0.684 | [0.649, 0.719] |
With Simple Normalization
| Metric | Score |
|---|---|
| Exact Match | 1.000 |
| F1 Score | 1.000 |
📈 Performance by Question Type (with normalization)
| Question Type | F1 Score | Notes |
|---|---|---|
| Coordinates | 1.000 | Requires space removal |
| Location | 1.000 | Requires post-processing |
| Etymology | 1.000 | Works perfectly |
| Type | 1.000 | Works perfectly |
| Region | 1.000 | Works perfectly |
| Sources | 1.000 | Works perfectly |
⚡ Speed Advantage
This model is ~3.5x faster than XLM-RoBERTa Large, making it ideal for production environments where speed matters.
🔧 Simple Normalization (One Line of Code!)
Add this after getting predictions from the model:
import re
def normalize_answer(text, question_type="coordinates"):
"""
Simple normalization for RuBERT models
"""
# Fix coordinates: "55. 175195" -> "55.175195"
if question_type == "coordinates":
text = re.sub(r'(\d+)\.\s+(\d+)', r'\1.\2', text)
text = re.sub(r'(\d+)\s+\.\s*(\d+)', r'\1.\2', text)
# Fix location: "северо - западу" -> "северо-западу"
if question_type == "location":
text = re.sub(r'\s*-\s*', '-', text)
text = re.sub(r'\(\s+', '(', text)
text = re.sub(r'\s+\)', ')', text)
# Fix extra spaces after punctuation
text = re.sub(r'\s+([.,;:!?)])', r'\1', text)
return text
# Example usage
predicted = "55. 175195, 58. 709845" # raw model output
normalized = normalize_answer(predicted, "coordinates")
print(normalized) # "55.175195, 58.709845" ✅
🚀 Quick Start
With Pipeline and Normalization
from transformers import pipeline
import re
# Load model
qa_pipeline = pipeline(
"question-answering",
model="TatarNLPWorld/rubert-base-tatar-toponyms-qa"
)
# Normalization function
def normalize_answer(text, question_type="coordinates"):
if question_type == "coordinates":
text = re.sub(r'(\d+)\.\s+(\d+)', r'\1.\2', text)
text = re.sub(r'(\d+)\s+\.\s*(\d+)', r'\1.\2', text)
if question_type == "location":
text = re.sub(r'\s*-\s*', '-', text)
text = re.sub(r'\(\s+', '(', text)
text = re.sub(r'\s+\)', ')', text)
return text
# Example
context = """
Название (рус): Рантамак | Объект: Село |
Расположение: на р. Мелля, в 21 км к востоку от с. Сарманово |
Координаты: 55.205461, 52.881862
"""
questions = [
("Где находится Рантамак?", "location"),
("Какие координаты у Рантамак?", "coordinates"),
("Что такое Рантамак?", "type")
]
for question, qtype in questions:
result = qa_pipeline(question=question, context=context)
normalized = normalize_answer(result['answer'], qtype)
print(f"Q: {question}")
print(f"A (raw): {result['answer']}")
print(f"A (norm): {normalized}")
print(f"Confidence: {result['score']:.3f}\n")
With PyTorch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
import re
# Load model
tokenizer = AutoTokenizer.from_pretrained("TatarNLPWorld/rubert-base-tatar-toponyms-qa")
model = AutoModelForQuestionAnswering.from_pretrained("TatarNLPWorld/rubert-base-tatar-toponyms-qa")
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
# Normalization function
def normalize_answer(text, question_type="coordinates"):
if question_type == "coordinates":
text = re.sub(r'(\d+)\.\s+(\d+)', r'\1.\2', text)
text = re.sub(r'(\d+)\s+\.\s*(\d+)', r'\1.\2', text)
if question_type == "location":
text = re.sub(r'\s*-\s*', '-', text)
text = re.sub(r'\(\s+', '(', text)
text = re.sub(r'\s+\)', ')', text)
return text
# Inference
inputs = tokenizer(question, context, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model(**inputs)
start_idx = torch.argmax(outputs.start_logits)
end_idx = torch.argmax(outputs.end_logits)
answer = tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx+1], skip_special_tokens=True)
normalized = normalize_answer(answer, "coordinates")
print(f"Answer: {normalized}")
📚 Training Details
Dataset
- Source: Tatarstan Toponyms Dataset
- QA pairs: 38,696 synthetic examples
- Train/Validation/Test split: 80%/10%/10%
- Question types: coordinates, location, etymology, type, region, sources
Training Parameters
| Parameter | Value |
|---|---|
| Base model | KirrAno93/rubert-base-cased-finetuned-squad |
| Epochs | 3 |
| Learning rate | 3e-5 |
| Batch size | 4 |
| Max sequence length | 384 |
| Optimizer | AdamW |
| Warmup steps | 500 |
| Weight decay | 0.01 |
| Hardware | NVIDIA GPU |
💡 Known Issues & Solutions
Issue 1: Extra spaces in coordinates
Problem: Model outputs "55. 175195" instead of "55.175195"
Solution:
text = re.sub(r'(\d+)\.\s+(\d+)', r'\1.\2', text)
Issue 2: Spaces around hyphens in location
Problem: "северо - западу" instead of "северо-западу"
Solution:
text = re.sub(r'\s*-\s*', '-', text)
Issue 3: Spaces inside parentheses
Problem: "( текст )" instead of "(текст)"
Solution:
text = re.sub(r'\(\s+', '(', text)
text = re.sub(r'\s+\)', ')', text)
Issue 4: Extra spaces after punctuation
Problem: "текст ." instead of "текст."
Solution:
text = re.sub(r'\s+([.,;:!?)])', r'\1', text)
🔗 Related Resources
Models in Collection
| Model | F1 Score (raw) | F1 Score (norm) | Speed |
|---|---|---|---|
| xlm-roberta-large | 0.994 | 0.994 | 22.4ms |
| rubert-base (this model) | 0.684 | 1.000 | 6.6ms |
| rubert-large | 0.679 | 1.000 | 6.5ms |
Datasets
- Tatarstan Toponyms QA Dataset - Training data
- Tatarstan Toponyms Dataset - Original data
⚡ Performance Comparison
| Aspect | XLM-RoBERTa Large | RuBERT Base |
|---|---|---|
| Raw Accuracy | 99.4% | 68.4% |
| With Normalization | 99.4% | 100% |
| Speed | 22.4ms | 6.6ms |
| Post-processing | Not needed | Required |
| Memory Usage | Higher | Lower |
🎯 When to Use This Model
- Need maximum speed: 3.5x faster than XLM-RoBERTa
- Resource constraints: Smaller memory footprint
- Can add post-processing: Simple regex fixes
- High throughput: Batch processing
- Russian-focused tasks: Optimized for Russian text
🏆 Why Choose RuBERT Base?
- Speed: Fastest model in the collection
- Accuracy: 100% after simple normalization
- Lightweight: Lower memory requirements
- Production-ready: Easy to deploy
- Cost-effective: Faster inference = lower costs
📝 Citation
If you use this model in your research, please cite:
@model{rubert_base_tatar_toponyms_qa,
author = {Arabov, Mullosharaf Kurbonvoich},
title = {RuBERT Base for Tatar Toponyms QA},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/TatarNLPWorld/rubert-base-tatar-toponyms-qa}}
}
👥 Team and Maintenance
- Developer: Mullosharaf Kurbonvoich Arabov
- Organization: TatarNLPWorld
- Project: Tat2Vec
🤝 Contributing
Contributions welcome! Please:
- Open issues for bugs
- Submit PRs for improvements
- Share your use cases
📅 Version: 1.0.0 | 📅 Published: 2026-03-10 | ⚡ Speed: 6.6ms | 🔧 Post-processing: Required | 🏆 Best for production
- Downloads last month
- 41
Dataset used to train TatarNLPWorld/rubert-base-tatar-toponyms-qa
Evaluation results
- Exact Match (raw) on Tatarstan Toponyms QAtest set self-reported0.402
- F1 Score (raw) on Tatarstan Toponyms QAtest set self-reported0.684
- Exact Match (with normalization) on Tatarstan Toponyms QAtest set self-reported1.000