πŸŽ™οΈ High-Fidelity Voice AI Dataset Pipeline

An industrial-strength, production-grade speech data factory designed for building expressive Text-to-Speech (TTS), Dubbing, and Voice AI training datasets. This system automatically ingests, normalizes, segments, diarizes, tags, validates, and exports raw audio and video data into clean, training-ready speech corpora.

πŸ—οΈ Pipeline Architecture

The pipeline executes a series of decoupled, modular stages. Each stage supports pluggable backends and state checkpoints for fault tolerance, resumability, and idempotence.

Voice AI Dataset Pipeline Architecture


πŸš€ Key Features

  • Idempotency & Resumability: Every stage uses file-based checkpoints. If a run crashes or is stopped, re-running the command skips already-completed files.
  • Model-Agnostic Interfaces: Abstract base classes define VAD (VadBackend), Diarization (DiarizationBackend), Emotion Tagging (EmotionBackend), and ASR (AsrBackend) backends. Swap between models easily.
  • Fault-Tolerant Processing: Individual file processing failures are isolated. If one audio file fails at a stage, the pipeline logs the failure, updates checkpoints, and moves to the next file without crashing.
  • Production Logging: Emits dual console logging (Rich interactive console logs) and file logging (structured JSON log lines for Kibana/Grafana monitoring).
  • Quality QA Guardrails: Filters training data based on signal metrics (clipping ratio, Signal-to-Noise Ratio (SNR), silence-to-speech ratio) and ML model confidence scores.
  • Diverse Export Formats: Exports datasets in segments.jsonl (standard manifest), annotations.csv, diarization.rttm, and HuggingFace-compatible dataset manifests, along with copied Wav files.

πŸ“ Repository Structure

voice_ai_pipeline/
β”œβ”€β”€ pyproject.toml                     # Package configurations & CLI declaration
β”œβ”€β”€ README.md                          # Repository Card / Documentation
β”œβ”€β”€ voice_pipeline/                    # Core Package
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ pipeline.py                    # Master pipeline orchestrator
β”‚   β”œβ”€β”€ ingestion/                     # Ingestion & YouTube download
β”‚   β”œβ”€β”€ audio_processing/              # Normalization & DSP preprocessing
β”‚   β”œβ”€β”€ speech_activity/               # Voice Activity Detection (VAD)
β”‚   β”œβ”€β”€ diarization/                   # Speaker turn clustering & mapping
β”‚   β”œβ”€β”€ emotion_tagging/               # Speech style & emotion classification
β”‚   β”œβ”€β”€ asr/                           # Whisper transcription & alignment
β”‚   β”œβ”€β”€ validation/                    # DSP & quality assurance validator
β”‚   β”œβ”€β”€ export/                        # Manifest & dataset generation
β”‚   β”œβ”€β”€ reports/                       # Summary metrics & analytics plots
β”‚   β”œβ”€β”€ utils/                         # Logging, checkpoints, & helper scripts
β”‚   β”œβ”€β”€ configs/                       # Configuration YAML & sources catalog
β”‚   └── cli/                           # Typer CLI application
β”œβ”€β”€ tests/                             # Test Suite
β”‚   └── unit/
└── data/                              # Local storage (created at runtime)
    β”œβ”€β”€ raw/                           # Downloaded raw audio files
    β”œβ”€β”€ processed/                     # Normalized WAV files
    β”œβ”€β”€ segments/                      # Split speech segment WAV files
    β”œβ”€β”€ exports/                       # Final exported manifests & WAVs
    β”œβ”€β”€ reports/                       # HTML/JSON reports & plots

πŸ› οΈ Installation & Setup

  1. Clone the Repository and navigate to the directory:

    git clone https://huggingface.co/bhriguverma/speech-data-factory
    cd speech-data-factory
    
  2. Install the package in editable mode (this installs all dependencies and registers the voice-pipeline command):

    pip install -e .
    
  3. Set up HuggingFace authorization (required for gated models like Pyannote Diarization 3.1):

    export HF_TOKEN="your_huggingface_write_token"
    

πŸ’» CLI Usage Examples

The pipeline installs a command-line tool voice-pipeline. You can run commands directly:

1. Process a Local File or Directory

Run the pipeline on a single audio file or directory of media files (MP3, WAV, MP4, MKV, etc.):

# Run on single file
voice-pipeline run-local data/my_audio.mp3 --lang hi --type audiobook

# Run on directory of files
voice-pipeline run-local /workspace/media_folder/ --lang hi --type podcast

2. Process a YouTube Video URL

Download and process a single YouTube video:

voice-pipeline run-youtube "https://www.youtube.com/watch?v=dQw4w9WgXcQ" --lang hi --type storytelling

3. Process an Entire YouTube Playlist

Batch download and process a storytelling playlist (capping at 10 videos):

voice-pipeline run-playlist "https://www.youtube.com/playlist?list=PL..." --max-videos 10 --lang hi --type audiobook

4. Check Checkpoint Statistics

View completion rates and failures per stage:

voice-pipeline show-stats

5. Reset Checkpoints

Force a clean re-run of a stage or the whole pipeline:

# Reset VAD stage only
voice-pipeline reset-checkpoints --stage vad

# Reset all stages
voice-pipeline reset-checkpoints

βš™οΈ Configuration

Runtime overrides can be configured directly in voice_pipeline/configs/pipeline_config.yaml. You can also override parameters using environment variables prefixed with VOICE_:

# Force target sample rate to 22.05kHz and run on CPU
export VOICE_AUDIO_TARGET_SAMPLE_RATE=22050
export VOICE_PIPELINE_DEVICE=cpu
voice-pipeline run-local data/audio.mp3

πŸ“Š Quality Reporting & Outputs

At the end of a run, the pipeline exports:

  1. JSON Manifest: data/exports/segments.jsonl contains detailed records for every valid segment, including start/end times, speaker ID, emotion, style, transcripts, and SNR.
  2. Dataset CSV: data/exports/annotations.csv provides a flat tabular spreadsheet view of the dataset.
  3. Analytics Dashboard: data/reports/quality_report.json details pass rates, emotion, and speaker distributions.
  4. Plots: Distribution plots (emotion_distribution.png, speaker_distribution.png, rejection_distribution.png) are generated under data/reports/.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support