Wren-TTS-360M (v1)
Wren is a series of small (<3B) multimodal speech LLMs covering TTS, ASR, and speech-language modelling. Wren-TTS-360M-v1 generates Kyutai Mimi neural-codec tokens from text using a HuggingFaceTB/SmolLM2-360M backbone, then decodes to 24 kHz waveform with the Mimi decoder.
An open research checkpoint β useful for experimentation, not production.
Architecture
text βββΊ SmolLM2-360M βββΊ k=8 parallel Mimi heads βββΊ Mimi decoder βββΊ 24 kHz
- Backbone: SmolLM2-360M (causal LM; text + audio share the same backbone)
- Audio tokenizer: Mimi (
kyutai/mimi), 12.5 fps, 2048-entry codebooks - Codebooks used: all 8 Mimi codebooks
- Layout: MusicGen-style delay pattern β at each step, k summed codebook
embeddings go in, k parallel heads predict k tokens out. Codebook q at frame f
lives at step
s = f + q, so same-frame RVQ conditioning is preserved via the delay. - Per-codebook input tables:
Embedding(2049, hidden)β extra row =AUDIO_PADat sequence edges. - Per-codebook output heads:
Linear(hidden, 2048)for cb1..cb7. cb0 getsLinear(hidden, 2049)with the extra class =AUDIO_EOS(stop token). - Speaker conditioning (required): prepend
<|reference_start|> ref_codes <|reference_end|>to the prompt;ref_codesis the Mimi encoding of a short reference clip. The model was trained multispeaker-only and expects a reference at inference β without one, output quality is poor.
Training data
Trained on four English speech corpora simultaneously:
- VCTK β 109 speakers across multiple English accents (~44k utterances)
- Jenny β single high-quality voice (~21k utterances)
- LibriTTS-R
train-clean-{100,360}+train-other-500β960 h multi-speaker English audiobooks (354k utterances) - LJSpeech β single speaker,
24 h (13k utterances)
Single-pass training (no two-stage pretraining + fine-tune). Held-out validation:
LibriTTS-R dev_clean + 5% of each non-LibriTTS source (deterministic shuffle, seed 2027).
Text casing and punctuation are preserved. Pass text naturally β do not pre-lowercase.
Usage
pip install torch torchaudio transformers datasets
A reference audio clip is required. The model was trained multispeaker-only; without
ref_codesit produces poor output. Any 3β12 s English speech clip works as the voice reference.
import torch
import numpy as np
from datasets import load_dataset
from transformers import AutoModel, AutoProcessor
model_id = "shangeth/Wren-TTS-360M-v1"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(model_id, trust_remote_code=True).to(device).eval()
# Grab one LibriSpeech test-clean clip (~3.5s, not in training) as the reference voice.
# Swap this block for `torchaudio.load("your_reference.wav")` to use your own clip.
sample = next(iter(load_dataset("openslr/librispeech_asr", "clean", split="test", streaming=True)))
ref_wav = torch.from_numpy(np.asarray(sample["audio"]["array"], dtype=np.float32)).unsqueeze(0)
ref_sr = sample["audio"]["sampling_rate"]
ref_codes = model.encode_audio(ref_wav, ref_sr)[:, :150] # cap at ~12s; encode_audio resamples to 24 kHz
# Tokenize the target text and generate speech in the reference voice
inputs = processor("Hello world, how are you today?")
inputs = {k: v.to(device) for k, v in inputs.items()}
waveform = model.generate(
**inputs,
ref_codes=ref_codes,
max_audio_frames=200,
min_audio_frames=2,
temperature=0.8, top_k=50, top_p=0.9,
output_audio=True,
)
processor.save_audio(waveform, "out.wav")
Sampling tips
Defaults: temperature=0.8, top_k=50, top_p=0.9, max_audio_frames=200 (~16 s).
If you hear the model generating extra speech past the intended text (hallucination):
- Raise
eos_biasβ e.g. 2.0β6.0 β to make the model more eager to stop - Lower
temperature(0.6) andtop_p(0.8) - Set
max_audio_framesβ12 * len(text_in_chars) - Set
min_audio_frames=1for very short prompts
Why delay pattern
Mimi uses residual vector quantization (RVQ): cb0 is semantic, cb1..cb7 encode successive residuals. cb_q is only meaningful given cb0..cb_{q-1}, so same-frame conditioning matters.
A flat interleaved layout (cb0_f0, cb1_f0, ..., cb0_f1, ...) preserves that
conditioning best but balloons sequence length by kΓ and forces k autoregressive
LLM calls per frame. The delay pattern keeps RVQ conditioning (cb_q at frame f is
predicted from a hidden state that has already attended over cb0..cb_{q-1} of the
same frame) while cutting sequence length to T + k - 1 and LLM calls to one per
step β enabling all 8 Mimi codebooks without blowing up context.
Limitations & known issues
- Hallucinated continuations: occasionally generates plausible speech past the
input text. Mitigate with
eos_biasat inference. - English only.
- Audiobook-style prosody inherited from LibriTTS-R; not as expressive as modern conversational TTS.
- Small backbone (360M) β quality is below frontier TTS systems.
- cb0 begins to overfit earlier than cb3βcb7; the released checkpoint is the best-epoch point (by overall val loss) from the full training run.
The Wren series
Wren is a family of compact (<3B parameter) multimodal speech LLMs β small enough to run on a single consumer GPU, designed for open research on unified speech understanding and synthesis. Planned siblings:
- Wren-TTS β text β speech (this release)
- Wren-ASR β speech β text
- Wren-LM β speech-language modelling / dialog
- Wren-Omni β unified ASR + TTS + LM in one checkpoint
All Wren models share the same design principles: small backbone LLM + neural audio codec, open weights, simple PyTorch checkpoints, reproducible training recipes.
Repository contents
| File | Purpose |
|---|---|
model.safetensors |
Model weights |
config.json |
WrenConfig (with auto_map for trust_remote_code) |
tokenizer.json + friends |
SmolLM2 tokenizer with Wren's 3 special tokens added |
processor_config.json |
WrenProcessor auto_map |
configuration_wren.py |
WrenConfig(PretrainedConfig) |
modeling_wren.py |
WrenForTTS(PreTrainedModel) β loads Mimi codec lazily on first generate |
processing_wren.py |
WrenProcessor(ProcessorMixin) β tokenize + save_audio |
README.md |
This model card |
Citation
@misc{wren2026,
title = {Wren: A Family of Small Open-Weight Models for Unified Speech-Text Modelling},
author = {Shangeth Rajaa},
year = {2026},
url = {https://github.com/shangeth/wren}
}
@inproceedings{koizumi2023libritts,
title = {LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus},
author = {Koizumi, Yuma and Zen, Heiga and Karita, Shigeki and Ding, Yifan
and Yatabe, Kohei and Morioka, Nobuyuki and Bacchiani, Michiel and
Zhang, Yu and Han, Wei and Bapna, Ankur},
booktitle = {Interspeech},
year = {2023}
}
License
Apache-2.0 for the checkpoint weights and code in this repo. Upstream components carry their own licenses β review before redistribution.
- Downloads last month
- 88