You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

asr-whisper-largev2-v6

This model is a domain-adapted version of openchs/asr-whisper-helpline-sw-v1 fine-tuned on an expanded dataset of real phone call recordings from the Tanzania Child Helpline system, powered by OpenCHS.

Model Description

This ASR model represents the second iteration of domain-specific fine-tuning (v6) with a significantly expanded training corpus compared to v5. The model continues to bridge the gap between clean, read speech and real-world telephony audio while benefiting from ~3.7x more training data than its predecessor.

Key Characteristics:

  • Domain: Child helpline phone call transcription (Swahili)
  • Best Checkpoint: Step 4,500
  • Validation WER: 44.02% on real phone call audio
  • Validation Loss: 0.7552
  • Training Dataset: Swahili ASR v7 (~173.4 hours of augmented telephony speech)

Performance Context: The model achieves 44.02% WER on real telephony audio, representing a 1.7 percentage point improvement over v5 (45.74% WER) while being trained on substantially more data. This demonstrates effective scaling with increased domain-specific training examples.

Training Strategy

Three-Stage Training Pipeline:

  1. Stage 1 - Common Voice 17.0: Initial fine-tuning from Whisper Large v2 (10,000 steps)
  2. Stage 2 - Common Voice 23.0: Continued training on updated Common Voice data (7,500 steps → 23.56% WER)
  3. Stage 3 (This Model) - Real Phone Calls v7: Domain adaptation on expanded helpline recordings (4,500 steps → 44.02% WER on telephony)

This model represents Stage 3 v6 with enhanced domain-specific optimization through increased training data volume.

Model Improvements Over v5

Data Scale Comparison

Version Training Samples Original Duration Augmented Duration Improvement
v5 31,720 ~9.3 hours ~46.5 hours Baseline
v6 113,545 ~39.0 hours ~173.4 hours +3.58x training data

Performance Comparison

Metric v5 (4,500 steps) v6 (4,500 steps) Improvement
Validation WER 45.74% 44.02% -1.72pp
Validation Loss 1.1333 0.7552 -33.4%
Training Duration ~46.5 hours ~173.4 hours +3.73x data

Key Improvements:

  • Better Generalization: 33% reduction in validation loss indicates significantly improved model confidence
  • Enhanced Accuracy: 1.7 percentage point WER improvement on telephony audio
  • More Training Data: 3.58x increase in training samples improves coverage of real-world scenarios
  • Consistent Convergence: Both v5 and v6 converged at step 4,500, showing reliable training dynamics

Comparative Training Curves

v5 Training Pattern:

  • Best WER at step 4,500: 45.74%
  • Performance degraded after step 4,500
  • Early stopping triggered after 3 evaluations without improvement

v6 Training Pattern:

  • Best WER at step 4,500: 44.02%
  • Performance plateaued after step 4,500 (no improvement for 3 consecutive evaluations)
  • Early stopping triggered at step 6,000 (patience=3)
  • More stable with larger dataset, achieving better performance at same convergence point

Intended Uses & Limitations

Intended Uses

Primary:

  • Transcribing Swahili speech in Tanzania Child Helpline call center environments
  • Real-time or batch processing of telephony audio (8kHz phone quality)
  • Production ASR system for helpline service documentation and analytics

Secondary:

  • General Swahili ASR for telephony/call center applications
  • Research baseline for domain adaptation studies (clean speech → telephony)
  • Transfer learning base for similar low-resource telephony ASR tasks

Key Improvements Over Previous Versions

Expanded Training Data: 3.58x more training samples than v5
Improved Accuracy: 1.7pp WER reduction on telephony audio
Better Confidence: 33% reduction in validation loss
Enhanced Coverage: More diverse call scenarios and speaker characteristics
Telephony Robustness: Optimized for phone bandwidth (8kHz) and call quality variations
Dialect Coverage: Trained on authentic Tanzanian Swahili dialects from real conversations
Production Ready: Validated on actual helpline audio (not just clean datasets)

Limitations

⚠️ Domain-Specific Vocabulary:

  • Optimized for child helpline and healthcare-related conversations
  • May underperform on technical, legal, or specialized domains outside training data scope

⚠️ Dialect Specificity:

  • Best performance on Tanzanian Swahili dialects represented in training data
  • May have reduced accuracy on coastal, northern, or other regional variants not well-represented

⚠️ Audio Quality Requirements:

  • Designed for telephony (8kHz-16kHz), may need retuning for high-fidelity audio
  • Performance degrades with severe background noise or very poor connections (though trained on augmented noisy data)

⚠️ Code-Switching:

  • Limited handling of Swahili-English code-switching common in urban Tanzania
  • May struggle with mixed-language utterances

⚠️ Model Size:

  • Large model (Whisper Large v2 architecture) requires GPU for real-time transcription
  • Consider quantization or distillation for edge deployment

Training and Evaluation Data

The model was trained on the Swahili ASR Dataset v7, a private dataset curated specifically for this task with significantly expanded coverage compared to v6.

Data Privacy & Access

Status: 🔒 Private / Internal Use Only

The dataset is not publicly available due to strict privacy and Personally Identifiable Information (PII) concerns. The source audio consists of real calls to the Tanzania Child Helpline. While the model weights are shared, the training data remains confidential to protect the identities of callers, many of whom are minors.

Dataset Volume (Hours & Samples)

The dataset utilizes a 4x augmentation strategy (for training set only) to maximize the utility of the available domain-specific audio.

Split Unique Samples Original Duration Augmented Duration Notes
Training 113,545 ~39.0 hours ~173.4 hours 1 original + ~3 augmented versions per sample
Validation 2,836 ~1.4 hours ~1.4 hours Original audio only (No augmentation)
Test 2,839 ~1.4 hours ~1.4 hours Original audio only (No augmentation)
TOTAL 119,220 ~41.7 hours ~176.2 hours

Data Characteristics

  • Source: Real-world phone call audio (not studio recordings)
  • Language: Tanzanian Swahili with natural conversational characteristics
  • Format: Telephony quality (primarily 8kHz, upsampled to 16kHz for Whisper)
  • Content: Domain-relevant vocabulary (child welfare, healthcare, family support)
  • Original Files: 28,355 unique audio recordings

Audio Augmentation Strategy

To make the model robust against the noisy environment of a call center, the training set was expanded from ~39.0 hours to ~173.4 hours using a multi-technique augmentation strategy.

Training samples were augmented with the following techniques:

  1. Volume Variation (±6 dB): Simulating distant or loud speakers
  2. VTLP (Vocal Tract Length Perturbation): Simulating different speaker characteristics
  3. Colored Noise: Simulating background static/environment (White/Pink/Brown noise)
  4. Time Stretch: Variation in speaking speed (0.9x - 1.1x)
  5. Pitch Shift: Variation in tone (±2 semitones)
  6. Packet Loss: Simulating VoIP connection drops
  7. Codec Masking: Simulating compression artifacts

Augmentation Files: 90,732 augmented audio files generated from 28,355 original recordings

Note: Validation and Test splits contain only original audio to ensure unbiased evaluation metrics.

Training Procedure

Training Hyperparameters

Optimization:

  • learning_rate: 1e-05
  • lr_scheduler_type: cosine_with_restarts
  • lr_scheduler_warmup_steps: 500
  • optimizer: AdamW (torch) with betas=(0.9, 0.999) and epsilon=1e-08
  • max_training_steps: 20,000 (early stopped at 6,000, best at 4,500)
  • seed: 42

Batch Configuration:

  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • gradient_accumulation_steps: 1
  • Effective batch size: 16

Memory Optimization:

  • gradient_checkpointing: true
  • mixed_precision_training: Native AMP (FP16)

Evaluation & Checkpointing:

  • evaluation_strategy: steps
  • eval_steps: 500
  • save_steps: 500
  • logging_steps: 50
  • save_total_limit: 3

Best Model Selection:

  • load_best_model_at_end: true
  • metric_for_best_model: "wer"
  • greater_is_better: false
  • early_stopping_patience: 3 evaluations (1,500 steps)

Infrastructure:

  • GPU: RunPod A40 (48GB VRAM)
  • Training time: ~6 hours for 6,000 steps
  • Checkpoint size: ~3GB per checkpoint

Training Results

Training Loss Epoch Step Validation Loss WER Notes
1.051 0.025 500 0.8807 49.3341 Initial adaptation
0.9047 0.05 1000 0.8018 52.9311
0.8012 0.075 1500 0.7702 48.1389
0.7306 0.1 2000 0.7492 48.0023
0.6764 0.125 2500 0.7437 45.3614
0.653 0.15 3000 0.7454 47.1372
0.6054 0.175 3500 0.7441 44.2231
0.5515 0.2 4000 0.7535 44.2459
0.5237 0.225 4500 0.7552 44.0182 Best checkpoint
0.4589 0.25 5000 0.7575 45.3728 No improvement (1/3)
0.4284 0.275 5500 0.7657 44.2573 No improvement (2/3)
0.3789 0.3 6000 0.7827 44.2686 No improvement (3/3) - Early stopping triggered

Training Observations:

  • Convergence: Best WER achieved at step 4,500 (44.02%)
  • Early Stopping: Triggered after 3 consecutive evaluations without improvement (patience=3)
  • Stable Learning: Consistent improvement through 4,500 steps with larger dataset
  • Model Selection: Weights automatically restored to step 4,500 checkpoint
  • Training Curve: Steady improvement from 49.33% → 44.02% WER over first 4,500 steps

Final Metrics (Step 4,500):

  • Training loss: 0.5237
  • Validation loss: 0.7552
  • Validation WER: 44.02%
  • Total training time: ~4.5 hours
  • Total training samples processed: ~511,000 (113,545 samples × ~4.5 epochs)

Domain Adaptation Summary

Stage Dataset WER Domain Gap
Stage 1 (Base) Common Voice 17.0 23.62% Clean read speech
Stage 2 (Base) Common Voice 23.0 23.56% Clean read speech
Stage 3 v5 Real Phone Calls v6 45.74% Telephony, conversational
Stage 3 v6 (This Model) Real Phone Calls v7 44.02% Telephony, conversational (expanded)

Domain Gap Analysis: The ~20.5 percentage point WER increase from Common Voice (23.56%) to real phone calls (44.02%) quantifies the domain adaptation challenge:

  • 📞 Telephony bandwidth vs. full-bandwidth audio
  • 🎤 Conversational vs. read speech
  • 🔊 Real noise conditions vs. clean recordings
  • 🗣️ Natural disfluencies vs. prepared text

This gap is expected and normal for production ASR systems deployed on telephony audio. The v6 improvement demonstrates effective learning from expanded training data, with both v5 and v6 converging at the same step count (4,500), indicating consistent training dynamics.

Performance Comparison

Model Test Domain Training Data WER Notes
Whisper Large v2 (zero-shot) Common Voice 17.0 - 89.05% Baseline
Base model (v1) - Stage 1 Common Voice 17.0 - 23.62% Clean speech tuning
Base model (v1) - Stage 2 Common Voice 23.0 - 23.56% Clean speech tuning
v5 model Real phone calls v6 ~46.5 hours 45.74% Telephony adaptation
This model (v6) Real phone calls v7 ~173.4 hours 44.02% Enhanced telephony adaptation

Key Insights:

  • Data Scaling Works: 3.73x more training data yields 1.7pp WER improvement
  • Production Optimized: Model optimized for actual telephony domain, not clean speech
  • Better Confidence: 33% lower validation loss (0.76 vs 1.13) indicates improved model reliability
  • Consistent Training: Both v5 and v6 converged at step 4,500, showing reliable optimization
  • Further Evaluation Ongoing: Comprehensive test set evaluation in progress

Usage

Quick Start

from transformers import pipeline

# Load the model
pipe = pipeline("automatic-speech-recognition",
                model="openchs/asr-whisper-largev2-v6")

# Transcribe phone call audio
result = pipe("path/to/phone_call.wav")
print(result["text"])

Advanced Usage with Audio Preprocessing

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa

# Load model and processor
processor = WhisperProcessor.from_pretrained("openchs/asr-whisper-largev2-v6")
model = WhisperForConditionalGeneration.from_pretrained("openchs/asr-whisper-largev2-v6")

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Load and preprocess audio (handles telephony audio)
audio, sr = librosa.load("path/to/phone_call.wav", sr=16000, mono=True)

# Process audio
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
input_features = input_features.to(device)

# Generate transcription with language hint
forced_decoder_ids = processor.get_decoder_prompt_ids(language="sw", task="transcribe")
predicted_ids = model.generate(
    input_features,
    forced_decoder_ids=forced_decoder_ids,
    max_length=448
)

# Decode transcription
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

Production Deployment Recommendations

Audio Requirements:

  • Sample rate: 16kHz (model will work with 8kHz telephony audio upsampled to 16kHz)
  • Format: Mono (single channel)
  • Duration: Optimal <30 seconds per segment for memory efficiency

Inference Optimization:

# Use half-precision for faster inference
model = model.half()  # FP16

# Enable batch processing for multiple files
batch_size = 8
results = pipe(audio_files, batch_size=batch_size)

Real-time Considerations:

  • GPU required for real-time transcription (RTF < 1.0)
  • CPU inference possible but slower (RTF ~3-5x on modern CPUs)
  • Consider model quantization for edge deployment

Evaluation Methodology

Validation Set:

  • 2,836 samples from real phone call recordings
  • Evaluated every 500 training steps
  • Represents diverse call scenarios and speakers
  • Original audio only (no augmentation)

WER Calculation:

  • Standard Word Error Rate: (Substitutions + Deletions + Insertions) / Total Words
  • Normalized text (lowercase, punctuation handling)
  • Swahili-specific text normalization applied

Best Model Selection:

  • Automatic selection based on lowest validation WER
  • Early stopping with patience of 3 evaluations
  • Final model: Step 4,500 checkpoint (restored after early stopping at step 6,000)

Future Work

  • Test set evaluation: Comprehensive evaluation on held-out 2,839-sample test set (in progress)
  • Benchmark against v5: Direct comparison on identical test data
  • Code-switching support: Improve Swahili-English mixed utterance handling
  • Model compression: Quantization and distillation for faster inference
  • Streaming ASR: Adapt for real-time streaming transcription
  • Dialect expansion: Include more regional Swahili variants
  • Noise robustness: Further augmentation with extreme noise conditions
  • Ablation studies: Analyze impact of different augmentation techniques

Citation

If you use this model in your research or production systems, please cite:

@misc{openchs-swahili-asr-v6,
  title={Enhanced Domain-Adapted Swahili ASR for Tanzania Child Helpline Telephony},
  author={OpenCHS Team},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/openchs/asr-whisper-largev2-v6}},
  note={Fine-tuned from openchs/asr-whisper-helpline-sw-v1 on expanded phone call data (v7)}
}

Framework Versions

  • Transformers: 4.56.2
  • PyTorch: 2.8.0+cu128
  • Datasets: 2.21.0
  • Tokenizers: 0.22.1

License

Apache 2.0

Acknowledgments


Model Status: ✅ Production Ready - Enhanced version optimized for Tanzania Child Helpline telephony transcription

Last Updated: 2025-12-17 (Checkpoint 4,500 restored after early stopping at step 6,000)

Downloads last month
8
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for openchs/asr-whisper-largev2-v6

Finetuned
(2)
this model

Evaluation results