Tiny Audio ASR - LoquaciousSet Training

A Speech-to-Text model trained using the Tiny Audio framework, combining a frozen Whisper encoder with a trained MLP projector and frozen SmolLM3-3B decoder.

Model Description

This model uses an encoder-projector-decoder architecture for automatic speech recognition:

Component Model Parameters Training Status
Audio Encoder openai/whisper-large-v3-turbo ~800M Frozen
Projector MLP 11.7M Trained
Language Model HuggingFaceTB/SmolLM3-3B 3B Frozen
Total - 3.72B 0.32% trainable

Training Details

Infrastructure

  • GPU: NVIDIA H100 80GB HBM3
  • Cloud Provider: E2E Networks
  • Framework: PyTorch 2.8.0, Transformers 4.57.3

Hyperparameters

  • Dataset: speechbrain/LoquaciousSet (small subset)
  • Train Samples: 1,000
  • Evaluation Samples: 100
  • Batch Size: 8
  • Learning Rate: 3e-4
  • Max Steps: 500
  • Warmup Steps: 50
  • Precision: BF16
  • Gradient Checkpointing: Enabled

Training Metrics

Step Training Loss Validation Loss
100 3.078 3.165
200 2.543 3.163
300 0.500 0.813
400 0.140 0.728
500 0.101 0.764

Training time: ~18 minutes on H100.

Usage

from src.asr_config import ASRConfig
from src.asr_modeling import ASRModel
import torchaudio

# Initialize model
config = ASRConfig(
    audio_model_id="openai/whisper-large-v3-turbo",
    text_model_id="HuggingFaceTB/SmolLM3-3B",
    projector_type="mlp",
    attn_implementation="sdpa",
)
model = ASRModel(config)

# Load audio
waveform, sample_rate = torchaudio.load("audio.wav")
if sample_rate != 16000:
    waveform = torchaudio.transforms.Resample(sample_rate, 16000)(waveform)
audio_array = waveform.squeeze().numpy()

# Transcribe
inputs = model.feature_extractor(
    audio_array, sampling_rate=16000, return_tensors="pt"
).input_features.to(model.device).to(model.dtype)

with torch.no_grad():
    output = model.generate(input_features=inputs, max_new_tokens=256)

transcription = model.tokenizer.decode(output[0], skip_special_tokens=True)
print(transcription)

Example Results

Input Audio: Sample from LoquaciousSet evaluation set

Ground Truth:

THESE ARE REFORMS THAT WILL DISCIPLINE AND CONSTRAIN THE EXERCISE OF POWER 
BY THE GOVERNMENT AND ANY OTHER ECONOMIC OR POLITICAL ACTOR FOR GENERATIONS TO COME

Model Output:

These are reforms that will discipline and constrain the exercise of power 
by the government and any other economic or political actor for generations to come

Limitations

  • Trained on a small subset (1,000 samples) for demonstration purposes
  • Full training with 50,000+ steps recommended for production use
  • English language only
  • Optimized for clean speech; performance may degrade on noisy audio

Citation

Tiny Audio Framework

@software{kroman2025tinyaudio,
  author = {Kroman, Alex},
  title = {Tiny Audio: Train Your Own Speech Recognition Model in 24 Hours},
  year = {2025},
  url = {https://github.com/alexkroman/tiny-audio}
}

LoquaciousSet Dataset

@misc{speechbrain2024loquaciousset,
  author = {{SpeechBrain Team}},
  title = {LoquaciousSet: 25,000 Hours of Transcribed English Speech},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/speechbrain/LoquaciousSet}
}

Whisper

@article{radford2022whisper,
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal = {arXiv preprint arXiv:2212.04356},
  year = {2022}
}

SmolLM

@misc{smollm2024,
  author = {{Hugging Face}},
  title = {SmolLM: Smaller Language Models for Efficient Inference},
  year = {2024},
  url = {https://huggingface.co/HuggingFaceTB/SmolLM3-3B}
}

License

Apache 2.0 - See the Tiny Audio repository for details.

Acknowledgments

Downloads last month
11
Safetensors
Model size
11.7M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mrbeniwal/tiny-audio-training-small

Finetuned
(85)
this model

Dataset used to train mrbeniwal/tiny-audio-training-small

Paper for mrbeniwal/tiny-audio-training-small