As a part of my pet project create this SAD[1] model. It takes a log mel-spectrogram as input and outputs concatenated array of onset and offset.
Loss - BCEWithLogitsLoss Optimizer - Adam
here are the metrics on the test set:
| Metric | Value |
|---|---|
| Accuracy | 0.9998331655911613 |
| Hamming | 0.00016682081819592185 |
| Precision | 0.9327198181417427 |
| Recall | 0.9306135245038709 |
| F1 | 0.9296357635399213 |
| Loss | 0.0008604296028513627 |
To download the model and the necessary code use the following snippet:
from huggingface_hub import snapshot_download
snapshot_download("hypersunflower/a_sad_model", local_dir = "model/", repo_type="model")
To use the model for inference[2]:
# load the scripts
from .model.speech_detection import detectSpeech
from .model.sadModel import sadModel
from .model.logMelSpectrogram import logMelSpectrogram
# load the models
detector = detectSpeech(
model_path="/model/a_sad_model.pth",
model_class=sadModel(),
logMelSpectrogram=logMelSpectrogram()
)
# inference
onset, offset = detector.detect("path_to_the_audio")
Note: the code uses pydub.AudioSegment to process the audio which requires ffmpeg. You can install it the following way:
!apt update &> /dev/null
!apt install ffmpeg -y &> /dev/null
This works for linux
Training code can be found here: https://github.com/ertan-somundzhu/sad-model
[1] short for Speech Activity Detection
[2] though the model showes good perfomance on the nccratliri/vad-human-ava-speech dataset (from which i took 25% procent of the original dataset), it will most likely fail when working with real-world noisy data