ποΈ High-Fidelity Voice AI Dataset Pipeline
An industrial-strength, production-grade speech data factory designed for building expressive Text-to-Speech (TTS), Dubbing, and Voice AI training datasets. This system automatically ingests, normalizes, segments, diarizes, tags, validates, and exports raw audio and video data into clean, training-ready speech corpora.
ποΈ Pipeline Architecture
The pipeline executes a series of decoupled, modular stages. Each stage supports pluggable backends and state checkpoints for fault tolerance, resumability, and idempotence.
π Key Features
- Idempotency & Resumability: Every stage uses file-based checkpoints. If a run crashes or is stopped, re-running the command skips already-completed files.
- Model-Agnostic Interfaces: Abstract base classes define VAD (
VadBackend), Diarization (DiarizationBackend), Emotion Tagging (EmotionBackend), and ASR (AsrBackend) backends. Swap between models easily. - Fault-Tolerant Processing: Individual file processing failures are isolated. If one audio file fails at a stage, the pipeline logs the failure, updates checkpoints, and moves to the next file without crashing.
- Production Logging: Emits dual console logging (Rich interactive console logs) and file logging (structured JSON log lines for Kibana/Grafana monitoring).
- Quality QA Guardrails: Filters training data based on signal metrics (clipping ratio, Signal-to-Noise Ratio (SNR), silence-to-speech ratio) and ML model confidence scores.
- Diverse Export Formats: Exports datasets in
segments.jsonl(standard manifest),annotations.csv,diarization.rttm, and HuggingFace-compatible dataset manifests, along with copied Wav files.
π Repository Structure
voice_ai_pipeline/
βββ pyproject.toml # Package configurations & CLI declaration
βββ README.md # Repository Card / Documentation
βββ voice_pipeline/ # Core Package
β βββ __init__.py
β βββ pipeline.py # Master pipeline orchestrator
β βββ ingestion/ # Ingestion & YouTube download
β βββ audio_processing/ # Normalization & DSP preprocessing
β βββ speech_activity/ # Voice Activity Detection (VAD)
β βββ diarization/ # Speaker turn clustering & mapping
β βββ emotion_tagging/ # Speech style & emotion classification
β βββ asr/ # Whisper transcription & alignment
β βββ validation/ # DSP & quality assurance validator
β βββ export/ # Manifest & dataset generation
β βββ reports/ # Summary metrics & analytics plots
β βββ utils/ # Logging, checkpoints, & helper scripts
β βββ configs/ # Configuration YAML & sources catalog
β βββ cli/ # Typer CLI application
βββ tests/ # Test Suite
β βββ unit/
βββ data/ # Local storage (created at runtime)
βββ raw/ # Downloaded raw audio files
βββ processed/ # Normalized WAV files
βββ segments/ # Split speech segment WAV files
βββ exports/ # Final exported manifests & WAVs
βββ reports/ # HTML/JSON reports & plots
π οΈ Installation & Setup
Clone the Repository and navigate to the directory:
git clone https://huggingface.co/bhriguverma/speech-data-factory cd speech-data-factoryInstall the package in editable mode (this installs all dependencies and registers the
voice-pipelinecommand):pip install -e .Set up HuggingFace authorization (required for gated models like Pyannote Diarization 3.1):
export HF_TOKEN="your_huggingface_write_token"
π» CLI Usage Examples
The pipeline installs a command-line tool voice-pipeline. You can run commands directly:
1. Process a Local File or Directory
Run the pipeline on a single audio file or directory of media files (MP3, WAV, MP4, MKV, etc.):
# Run on single file
voice-pipeline run-local data/my_audio.mp3 --lang hi --type audiobook
# Run on directory of files
voice-pipeline run-local /workspace/media_folder/ --lang hi --type podcast
2. Process a YouTube Video URL
Download and process a single YouTube video:
voice-pipeline run-youtube "https://www.youtube.com/watch?v=dQw4w9WgXcQ" --lang hi --type storytelling
3. Process an Entire YouTube Playlist
Batch download and process a storytelling playlist (capping at 10 videos):
voice-pipeline run-playlist "https://www.youtube.com/playlist?list=PL..." --max-videos 10 --lang hi --type audiobook
4. Check Checkpoint Statistics
View completion rates and failures per stage:
voice-pipeline show-stats
5. Reset Checkpoints
Force a clean re-run of a stage or the whole pipeline:
# Reset VAD stage only
voice-pipeline reset-checkpoints --stage vad
# Reset all stages
voice-pipeline reset-checkpoints
βοΈ Configuration
Runtime overrides can be configured directly in voice_pipeline/configs/pipeline_config.yaml.
You can also override parameters using environment variables prefixed with VOICE_:
# Force target sample rate to 22.05kHz and run on CPU
export VOICE_AUDIO_TARGET_SAMPLE_RATE=22050
export VOICE_PIPELINE_DEVICE=cpu
voice-pipeline run-local data/audio.mp3
π Quality Reporting & Outputs
At the end of a run, the pipeline exports:
- JSON Manifest:
data/exports/segments.jsonlcontains detailed records for every valid segment, including start/end times, speaker ID, emotion, style, transcripts, and SNR. - Dataset CSV:
data/exports/annotations.csvprovides a flat tabular spreadsheet view of the dataset. - Analytics Dashboard:
data/reports/quality_report.jsondetails pass rates, emotion, and speaker distributions. - Plots: Distribution plots (
emotion_distribution.png,speaker_distribution.png,rejection_distribution.png) are generated underdata/reports/.