VANPY: Voice Analysis Framework
Paper
•
2502.17579
•
Published
•
1
This model combines the SpeechBrain ECAPA-TDNN speaker embedding model with an ANN regressor to predict speaker age from audio input. The model uses ECAPA embeddings and Librosa acoustic features, trained on both VoxCeleb2 and TIMIT datasets.
We provide multiple pre-trained models with different architectures and feature sets. Here's a comprehensive comparison of their performance:
| Model | Architecture | Features | Training Data | Test MAE | Best For |
|---|---|---|---|---|---|
| VoxCeleb2 SVR (223) | SVR | ECAPA + Librosa (223-dim) | VoxCeleb2 | 7.88 years | Best performance on VoxCeleb2 |
| VoxCeleb2 SVR (192) | SVR | ECAPA only (192-dim) | VoxCeleb2 | 7.89 years | Lightweight deployment |
| TIMIT ANN (192) | ANN | ECAPA only (192-dim) | TIMIT | 4.95 years | Clean studio recordings |
| Combined ANN (223) | ANN | ECAPA + Librosa (223-dim) | VoxCeleb2 + TIMIT | 6.93 years | Best general performance |
You may find other models here.
The model was trained on a combination of datasets:
pip install git+https://github.com/griko/voice-age-regression.git#egg=voice-age-regressor[full]
from age_regressor import AgeRegressionPipeline
# Load the pipeline
regressor = AgeRegressionPipeline.from_pretrained(
"griko/age_reg_ann_ecapa_librosa_combined"
)
# Single file prediction
result = regressor("path/to/audio.wav")
print(f"Predicted age: {result[0]:.1f} years")
# Batch prediction
results = regressor(["audio1.wav", "audio2.wav"])
print(f"Predicted ages: {[f'{age:.1f}' for age in results]} years")
If you use this model in your research, please cite:
@misc{koushnir2025vanpyvoiceanalysisframework,
title={VANPY: Voice Analysis Framework},
author={Gregory Koushnir and Michael Fire and Galit Fuhrmann Alpert and Dima Kagan},
year={2025},
eprint={2502.17579},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2502.17579},
}