Tiny Audio ASR - LoquaciousSet Training

A Speech-to-Text model trained using the Tiny Audio framework, combining a frozen Whisper encoder with a trained MLP projector and frozen SmolLM3-3B decoder.

Model Description

This model uses an encoder-projector-decoder architecture for automatic speech recognition:

Component	Model	Parameters	Training Status
Audio Encoder	openai/whisper-large-v3-turbo	~800M	Frozen
Projector	MLP	11.7M	Trained
Language Model	HuggingFaceTB/SmolLM3-3B	3B	Frozen
Total	-	3.72B	0.32% trainable

Training Details

Infrastructure

GPU: NVIDIA H100 80GB HBM3
Cloud Provider: E2E Networks
Framework: PyTorch 2.8.0, Transformers 4.57.3

Hyperparameters

Dataset: speechbrain/LoquaciousSet (small subset)
Train Samples: 1,000
Evaluation Samples: 100
Batch Size: 8
Learning Rate: 3e-4
Max Steps: 500
Warmup Steps: 50
Precision: BF16
Gradient Checkpointing: Enabled

Training Metrics

Step	Training Loss	Validation Loss
100	3.078	3.165
200	2.543	3.163
300	0.500	0.813
400	0.140	0.728
500	0.101	0.764

Training time: ~18 minutes on H100.

Usage

from src.asr_config import ASRConfig
from src.asr_modeling import ASRModel
import torchaudio

# Initialize model
config = ASRConfig(
    audio_model_id="openai/whisper-large-v3-turbo",
    text_model_id="HuggingFaceTB/SmolLM3-3B",
    projector_type="mlp",
    attn_implementation="sdpa",
)
model = ASRModel(config)

# Load audio
waveform, sample_rate = torchaudio.load("audio.wav")
if sample_rate != 16000:
    waveform = torchaudio.transforms.Resample(sample_rate, 16000)(waveform)
audio_array = waveform.squeeze().numpy()

# Transcribe
inputs = model.feature_extractor(
    audio_array, sampling_rate=16000, return_tensors="pt"
).input_features.to(model.device).to(model.dtype)

with torch.no_grad():
    output = model.generate(input_features=inputs, max_new_tokens=256)

transcription = model.tokenizer.decode(output[0], skip_special_tokens=True)
print(transcription)

Example Results

Input Audio: Sample from LoquaciousSet evaluation set

Ground Truth:

THESE ARE REFORMS THAT WILL DISCIPLINE AND CONSTRAIN THE EXERCISE OF POWER 
BY THE GOVERNMENT AND ANY OTHER ECONOMIC OR POLITICAL ACTOR FOR GENERATIONS TO COME

Model Output:

These are reforms that will discipline and constrain the exercise of power 
by the government and any other economic or political actor for generations to come

Limitations

Trained on a small subset (1,000 samples) for demonstration purposes
Full training with 50,000+ steps recommended for production use
English language only
Optimized for clean speech; performance may degrade on noisy audio

Citation

Tiny Audio Framework

@software{kroman2025tinyaudio,
  author = {Kroman, Alex},
  title = {Tiny Audio: Train Your Own Speech Recognition Model in 24 Hours},
  year = {2025},
  url = {https://github.com/alexkroman/tiny-audio}
}

LoquaciousSet Dataset

@misc{speechbrain2024loquaciousset,
  author = {{SpeechBrain Team}},
  title = {LoquaciousSet: 25,000 Hours of Transcribed English Speech},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/speechbrain/LoquaciousSet}
}

Whisper

@article{radford2022whisper,
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal = {arXiv preprint arXiv:2212.04356},
  year = {2022}
}

SmolLM

@misc{smollm2024,
  author = {{Hugging Face}},
  title = {SmolLM: Smaller Language Models for Efficient Inference},
  year = {2024},
  url = {https://huggingface.co/HuggingFaceTB/SmolLM3-3B}
}

License

Apache 2.0 - See the Tiny Audio repository for details.

Acknowledgments

Alex Kroman for the Tiny Audio framework
SpeechBrain for the LoquaciousSet dataset
OpenAI for Whisper
Hugging Face for SmolLM3 and infrastructure
E2E Networks for GPU cloud infrastructure

Downloads last month: 11

Safetensors

Model size

11.7M params

Tensor type

BF16

Model tree for mrbeniwal/tiny-audio-training-small

Base model

HuggingFaceTB/SmolLM3-3B-Base

Finetuned

HuggingFaceTB/SmolLM3-3B

Finetuned

(85)

this model

Dataset used to train mrbeniwal/tiny-audio-training-small

Paper for mrbeniwal/tiny-audio-training-small

Robust Speech Recognition via Large-Scale Weak Supervision

Paper • 2212.04356 • Published Dec 6, 2022 • 45