Voice Activity Detection
English

As a part of my pet project create this SAD[1] model. It takes a log mel-spectrogram as input and outputs concatenated array of onset and offset.

Loss - BCEWithLogitsLoss Optimizer - Adam

here are the metrics on the test set:

Metric Value
Accuracy 0.9998331655911613
Hamming 0.00016682081819592185
Precision 0.9327198181417427
Recall 0.9306135245038709
F1 0.9296357635399213
Loss 0.0008604296028513627

To download the model and the necessary code use the following snippet:

from huggingface_hub import snapshot_download
snapshot_download("hypersunflower/a_sad_model", local_dir = "model/", repo_type="model")

To use the model for inference[2]:

# load the scripts
from .model.speech_detection import detectSpeech
from .model.sadModel import sadModel
from .model.logMelSpectrogram import logMelSpectrogram

# load the models
detector = detectSpeech(
    model_path="/model/a_sad_model.pth",
    model_class=sadModel(),
    logMelSpectrogram=logMelSpectrogram()
)

# inference
onset, offset = detector.detect("path_to_the_audio")

Note: the code uses pydub.AudioSegment to process the audio which requires ffmpeg. You can install it the following way:

!apt update &> /dev/null
!apt install ffmpeg -y &> /dev/null

This works for linux

Training code can be found here: https://github.com/ertan-somundzhu/sad-model

[1] short for Speech Activity Detection

[2] though the model showes good perfomance on the nccratliri/vad-human-ava-speech dataset (from which i took 25% procent of the original dataset), it will most likely fail when working with real-world noisy data

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train hypersunflower/a_sad_model