iitb-t5-finetuned-punctuation

This model is a fine-tuned version of google-t5/t5-base on the English Punctuation Restoration dataset. It was developed as part of the research presented in the paper "Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation".

Model description

This model is designed to restore missing punctuation in English sentences. It serves as a critical component in a "restore-then-translate" pipeline, where it resolves semantic and structural ambiguities in source English text before it is translated into Marathi. By adding appropriate marks like commas and periods, it improves the overall reliability and meaning preservation of downstream translation tasks.

Intended uses & limitations

The model is intended for punctuation restoration in English text, particularly as a pre-processing step for low-resource machine translation pipelines. It focuses on resolving ambiguities where the absence of punctuation (like commas) changes the sentence's interpretation.

Model Usage

from transformers import pipeline

# Initialize the pipeline for text-to-text generation
punctuator_pipeline = pipeline("text2text-generation", model="thenlpresearcher/iitb-t5-finetuned-punctuation")

text = "the morning sky stretched over the city like a quiet sheet of pale blue while people hurried through the streets"

# Run the text through the pipeline
output = punctuator_pipeline(text, max_length=128)

print(output)
# Sample Output: [{'generated_text': 'the morning sky stretched over the city like a quiet sheet of pale blue while people hurried through the streets.'}]

Training and evaluation data

The model was trained on the english_punctuation_restoration dataset. Evaluation was conducted using the Virām benchmark, a manually curated set of 54 punctuation-ambiguous English instances designed to test the robustness of translation systems.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 32
  • eval_batch_size: 64
  • seed: 42
  • optimizer: adamw_torch with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 3
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Bleu
0.0988 1.0 6441 0.0947 52.8823
0.0879 2.0 12882 0.0910 52.9691
0.0832 3.0 19323 0.0897 53.0293

Framework versions

  • Transformers 4.50.0
  • Pytorch 2.5.1+cu121
  • Datasets 2.21.0
  • Tokenizers 0.21.4
Downloads last month
1
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thenlpresearcher/iitb-t5-finetuned-punctuation

Base model

google-t5/t5-base
Finetuned
(716)
this model

Dataset used to train thenlpresearcher/iitb-t5-finetuned-punctuation

Space using thenlpresearcher/iitb-t5-finetuned-punctuation 1

Paper for thenlpresearcher/iitb-t5-finetuned-punctuation