🎙️ High-Fidelity Voice AI Dataset Pipeline

An industrial-strength, production-grade speech data factory designed for building expressive Text-to-Speech (TTS), Dubbing, and Voice AI training datasets. This system automatically ingests, normalizes, segments, diarizes, tags, validates, and exports raw audio and video data into clean, training-ready speech corpora.

🏗️ Pipeline Architecture

The pipeline executes a series of decoupled, modular stages. Each stage supports pluggable backends and state checkpoints for fault tolerance, resumability, and idempotence.

Voice AI Dataset Pipeline Architecture

🚀 Key Features

Idempotency & Resumability: Every stage uses file-based checkpoints. If a run crashes or is stopped, re-running the command skips already-completed files.
Model-Agnostic Interfaces: Abstract base classes define VAD (VadBackend), Diarization (DiarizationBackend), Emotion Tagging (EmotionBackend), and ASR (AsrBackend) backends. Swap between models easily.
Fault-Tolerant Processing: Individual file processing failures are isolated. If one audio file fails at a stage, the pipeline logs the failure, updates checkpoints, and moves to the next file without crashing.
Production Logging: Emits dual console logging (Rich interactive console logs) and file logging (structured JSON log lines for Kibana/Grafana monitoring).
Quality QA Guardrails: Filters training data based on signal metrics (clipping ratio, Signal-to-Noise Ratio (SNR), silence-to-speech ratio) and ML model confidence scores.
Diverse Export Formats: Exports datasets in segments.jsonl (standard manifest), annotations.csv, diarization.rttm, and HuggingFace-compatible dataset manifests, along with copied Wav files.

📁 Repository Structure

voice_ai_pipeline/
├── pyproject.toml                     # Package configurations & CLI declaration
├── README.md                          # Repository Card / Documentation
├── voice_pipeline/                    # Core Package
│   ├── __init__.py
│   ├── pipeline.py                    # Master pipeline orchestrator
│   ├── ingestion/                     # Ingestion & YouTube download
│   ├── audio_processing/              # Normalization & DSP preprocessing
│   ├── speech_activity/               # Voice Activity Detection (VAD)
│   ├── diarization/                   # Speaker turn clustering & mapping
│   ├── emotion_tagging/               # Speech style & emotion classification
│   ├── asr/                           # Whisper transcription & alignment
│   ├── validation/                    # DSP & quality assurance validator
│   ├── export/                        # Manifest & dataset generation
│   ├── reports/                       # Summary metrics & analytics plots
│   ├── utils/                         # Logging, checkpoints, & helper scripts
│   ├── configs/                       # Configuration YAML & sources catalog
│   └── cli/                           # Typer CLI application
├── tests/                             # Test Suite
│   └── unit/
└── data/                              # Local storage (created at runtime)
    ├── raw/                           # Downloaded raw audio files
    ├── processed/                     # Normalized WAV files
    ├── segments/                      # Split speech segment WAV files
    ├── exports/                       # Final exported manifests & WAVs
    ├── reports/                       # HTML/JSON reports & plots

🛠️ Installation & Setup

Clone the Repository and navigate to the directory:

git clone https://huggingface.co/bhriguverma/speech-data-factory
cd speech-data-factory

Install the package in editable mode (this installs all dependencies and registers the voice-pipeline command):
```
pip install -e .
```
Set up HuggingFace authorization (required for gated models like Pyannote Diarization 3.1):
```
export HF_TOKEN="your_huggingface_write_token"
```

💻 CLI Usage Examples

The pipeline installs a command-line tool voice-pipeline. You can run commands directly:

1. Process a Local File or Directory

Run the pipeline on a single audio file or directory of media files (MP3, WAV, MP4, MKV, etc.):

# Run on single file
voice-pipeline run-local data/my_audio.mp3 --lang hi --type audiobook

# Run on directory of files
voice-pipeline run-local /workspace/media_folder/ --lang hi --type podcast

2. Process a YouTube Video URL

Download and process a single YouTube video:

voice-pipeline run-youtube "https://www.youtube.com/watch?v=dQw4w9WgXcQ" --lang hi --type storytelling

3. Process an Entire YouTube Playlist

Batch download and process a storytelling playlist (capping at 10 videos):

voice-pipeline run-playlist "https://www.youtube.com/playlist?list=PL..." --max-videos 10 --lang hi --type audiobook

4. Check Checkpoint Statistics

View completion rates and failures per stage:

voice-pipeline show-stats

5. Reset Checkpoints

Force a clean re-run of a stage or the whole pipeline:

# Reset VAD stage only
voice-pipeline reset-checkpoints --stage vad

# Reset all stages
voice-pipeline reset-checkpoints

⚙️ Configuration

Runtime overrides can be configured directly in voice_pipeline/configs/pipeline_config.yaml. You can also override parameters using environment variables prefixed with VOICE_:

# Force target sample rate to 22.05kHz and run on CPU
export VOICE_AUDIO_TARGET_SAMPLE_RATE=22050
export VOICE_PIPELINE_DEVICE=cpu
voice-pipeline run-local data/audio.mp3

📊 Quality Reporting & Outputs

At the end of a run, the pipeline exports:

JSON Manifest: data/exports/segments.jsonl contains detailed records for every valid segment, including start/end times, speaker ID, emotion, style, transcripts, and SNR.
Dataset CSV: data/exports/annotations.csv provides a flat tabular spreadsheet view of the dataset.
Analytics Dashboard: data/reports/quality_report.json details pass rates, emotion, and speaker distributions.
Plots: Distribution plots (emotion_distribution.png, speaker_distribution.png, rejection_distribution.png) are generated under data/reports/.

Downloads last month: -; Downloads are not tracked for this model. How to track