iitb-t5-finetuned-punctuation
This model is a fine-tuned version of google-t5/t5-base on the English Punctuation Restoration dataset. It was developed as part of the research presented in the paper "Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation".
Model description
This model is designed to restore missing punctuation in English sentences. It serves as a critical component in a "restore-then-translate" pipeline, where it resolves semantic and structural ambiguities in source English text before it is translated into Marathi. By adding appropriate marks like commas and periods, it improves the overall reliability and meaning preservation of downstream translation tasks.
- Paper: Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation
- Repository: Viram_Marathi GitHub
- Demo: Punctuation Robust English-to-Marathi Translation
Intended uses & limitations
The model is intended for punctuation restoration in English text, particularly as a pre-processing step for low-resource machine translation pipelines. It focuses on resolving ambiguities where the absence of punctuation (like commas) changes the sentence's interpretation.
Model Usage
from transformers import pipeline
# Initialize the pipeline for text-to-text generation
punctuator_pipeline = pipeline("text2text-generation", model="thenlpresearcher/iitb-t5-finetuned-punctuation")
text = "the morning sky stretched over the city like a quiet sheet of pale blue while people hurried through the streets"
# Run the text through the pipeline
output = punctuator_pipeline(text, max_length=128)
print(output)
# Sample Output: [{'generated_text': 'the morning sky stretched over the city like a quiet sheet of pale blue while people hurried through the streets.'}]
Training and evaluation data
The model was trained on the english_punctuation_restoration dataset. Evaluation was conducted using the Virām benchmark, a manually curated set of 54 punctuation-ambiguous English instances designed to test the robustness of translation systems.
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 32
- eval_batch_size: 64
- seed: 42
- optimizer: adamw_torch with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3
- mixed_precision_training: Native AMP
Training results
| Training Loss | Epoch | Step | Validation Loss | Bleu |
|---|---|---|---|---|
| 0.0988 | 1.0 | 6441 | 0.0947 | 52.8823 |
| 0.0879 | 2.0 | 12882 | 0.0910 | 52.9691 |
| 0.0832 | 3.0 | 19323 | 0.0897 | 53.0293 |
Framework versions
- Transformers 4.50.0
- Pytorch 2.5.1+cu121
- Datasets 2.21.0
- Tokenizers 0.21.4
- Downloads last month
- 1
Model tree for thenlpresearcher/iitb-t5-finetuned-punctuation
Base model
google-t5/t5-base