Robust Speech Recognition via Large-Scale Weak Supervision
Paper
•
2212.04356
•
Published
•
45
A Speech-to-Text model trained using the Tiny Audio framework, combining a frozen Whisper encoder with a trained MLP projector and frozen SmolLM3-3B decoder.
This model uses an encoder-projector-decoder architecture for automatic speech recognition:
| Component | Model | Parameters | Training Status |
|---|---|---|---|
| Audio Encoder | openai/whisper-large-v3-turbo | ~800M | Frozen |
| Projector | MLP | 11.7M | Trained |
| Language Model | HuggingFaceTB/SmolLM3-3B | 3B | Frozen |
| Total | - | 3.72B | 0.32% trainable |
| Step | Training Loss | Validation Loss |
|---|---|---|
| 100 | 3.078 | 3.165 |
| 200 | 2.543 | 3.163 |
| 300 | 0.500 | 0.813 |
| 400 | 0.140 | 0.728 |
| 500 | 0.101 | 0.764 |
Training time: ~18 minutes on H100.
from src.asr_config import ASRConfig
from src.asr_modeling import ASRModel
import torchaudio
# Initialize model
config = ASRConfig(
audio_model_id="openai/whisper-large-v3-turbo",
text_model_id="HuggingFaceTB/SmolLM3-3B",
projector_type="mlp",
attn_implementation="sdpa",
)
model = ASRModel(config)
# Load audio
waveform, sample_rate = torchaudio.load("audio.wav")
if sample_rate != 16000:
waveform = torchaudio.transforms.Resample(sample_rate, 16000)(waveform)
audio_array = waveform.squeeze().numpy()
# Transcribe
inputs = model.feature_extractor(
audio_array, sampling_rate=16000, return_tensors="pt"
).input_features.to(model.device).to(model.dtype)
with torch.no_grad():
output = model.generate(input_features=inputs, max_new_tokens=256)
transcription = model.tokenizer.decode(output[0], skip_special_tokens=True)
print(transcription)
Input Audio: Sample from LoquaciousSet evaluation set
Ground Truth:
THESE ARE REFORMS THAT WILL DISCIPLINE AND CONSTRAIN THE EXERCISE OF POWER
BY THE GOVERNMENT AND ANY OTHER ECONOMIC OR POLITICAL ACTOR FOR GENERATIONS TO COME
Model Output:
These are reforms that will discipline and constrain the exercise of power
by the government and any other economic or political actor for generations to come
@software{kroman2025tinyaudio,
author = {Kroman, Alex},
title = {Tiny Audio: Train Your Own Speech Recognition Model in 24 Hours},
year = {2025},
url = {https://github.com/alexkroman/tiny-audio}
}
@misc{speechbrain2024loquaciousset,
author = {{SpeechBrain Team}},
title = {LoquaciousSet: 25,000 Hours of Transcribed English Speech},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/speechbrain/LoquaciousSet}
}
@article{radford2022whisper,
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
journal = {arXiv preprint arXiv:2212.04356},
year = {2022}
}
@misc{smollm2024,
author = {{Hugging Face}},
title = {SmolLM: Smaller Language Models for Efficient Inference},
year = {2024},
url = {https://huggingface.co/HuggingFaceTB/SmolLM3-3B}
}
Apache 2.0 - See the Tiny Audio repository for details.
Base model
HuggingFaceTB/SmolLM3-3B-Base