iitb-t5-finetuned-punctuation

This model is a fine-tuned version of google-t5/t5-base on the English Punctuation Restoration dataset. It was developed as part of the research presented in the paper "Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation".

Model description

This model is designed to restore missing punctuation in English sentences. It serves as a critical component in a "restore-then-translate" pipeline, where it resolves semantic and structural ambiguities in source English text before it is translated into Marathi. By adding appropriate marks like commas and periods, it improves the overall reliability and meaning preservation of downstream translation tasks.

Paper: Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation
Repository: Viram_Marathi GitHub
Demo: Punctuation Robust English-to-Marathi Translation

Intended uses & limitations

The model is intended for punctuation restoration in English text, particularly as a pre-processing step for low-resource machine translation pipelines. It focuses on resolving ambiguities where the absence of punctuation (like commas) changes the sentence's interpretation.

Model Usage

from transformers import pipeline

# Initialize the pipeline for text-to-text generation
punctuator_pipeline = pipeline("text2text-generation", model="thenlpresearcher/iitb-t5-finetuned-punctuation")

text = "the morning sky stretched over the city like a quiet sheet of pale blue while people hurried through the streets"

# Run the text through the pipeline
output = punctuator_pipeline(text, max_length=128)

print(output)
# Sample Output: [{'generated_text': 'the morning sky stretched over the city like a quiet sheet of pale blue while people hurried through the streets.'}]

Training and evaluation data

The model was trained on the english_punctuation_restoration dataset. Evaluation was conducted using the Virām benchmark, a manually curated set of 54 punctuation-ambiguous English instances designed to test the robustness of translation systems.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 32
eval_batch_size: 64
seed: 42
optimizer: adamw_torch with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 3
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Bleu
0.0988	1.0	6441	0.0947	52.8823
0.0879	2.0	12882	0.0910	52.9691
0.0832	3.0	19323	0.0897	53.0293

Framework versions

Transformers 4.50.0
Pytorch 2.5.1+cu121
Datasets 2.21.0
Tokenizers 0.21.4

Downloads last month: 1

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for thenlpresearcher/iitb-t5-finetuned-punctuation

Base model

google-t5/t5-base

Finetuned

(716)

this model

Dataset used to train thenlpresearcher/iitb-t5-finetuned-punctuation

Space using thenlpresearcher/iitb-t5-finetuned-punctuation 1

Paper for thenlpresearcher/iitb-t5-finetuned-punctuation

Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation

Paper • 2601.09725 • Published Dec 28, 2025 • 1