Wren-TTS-360M (v1)

Wren is a series of small (<3B) multimodal speech LLMs covering TTS, ASR, and speech-language modelling. Wren-TTS-360M-v1 generates Kyutai Mimi neural-codec tokens from text using a HuggingFaceTB/SmolLM2-360M backbone, then decodes to 24 kHz waveform with the Mimi decoder.

An open research checkpoint β€” useful for experimentation, not production.

Architecture

text ──► SmolLM2-360M ──► k=8 parallel Mimi heads ──► Mimi decoder ──► 24 kHz
  • Backbone: SmolLM2-360M (causal LM; text + audio share the same backbone)
  • Audio tokenizer: Mimi (kyutai/mimi), 12.5 fps, 2048-entry codebooks
  • Codebooks used: all 8 Mimi codebooks
  • Layout: MusicGen-style delay pattern β€” at each step, k summed codebook embeddings go in, k parallel heads predict k tokens out. Codebook q at frame f lives at step s = f + q, so same-frame RVQ conditioning is preserved via the delay.
  • Per-codebook input tables: Embedding(2049, hidden) β€” extra row = AUDIO_PAD at sequence edges.
  • Per-codebook output heads: Linear(hidden, 2048) for cb1..cb7. cb0 gets Linear(hidden, 2049) with the extra class = AUDIO_EOS (stop token).
  • Speaker conditioning (required): prepend <|reference_start|> ref_codes <|reference_end|> to the prompt; ref_codes is the Mimi encoding of a short reference clip. The model was trained multispeaker-only and expects a reference at inference β€” without one, output quality is poor.

Training data

Trained on four English speech corpora simultaneously:

  • VCTK β€” 109 speakers across multiple English accents (~44k utterances)
  • Jenny β€” single high-quality voice (~21k utterances)
  • LibriTTS-R train-clean-{100,360} + train-other-500 β€” 960 h multi-speaker English audiobooks (354k utterances)
  • LJSpeech β€” single speaker, 24 h (13k utterances)

Single-pass training (no two-stage pretraining + fine-tune). Held-out validation: LibriTTS-R dev_clean + 5% of each non-LibriTTS source (deterministic shuffle, seed 2027).

Text casing and punctuation are preserved. Pass text naturally β€” do not pre-lowercase.

Usage

pip install torch torchaudio transformers datasets

A reference audio clip is required. The model was trained multispeaker-only; without ref_codes it produces poor output. Any 3–12 s English speech clip works as the voice reference.

import torch
import numpy as np
from datasets import load_dataset
from transformers import AutoModel, AutoProcessor

model_id = "shangeth/Wren-TTS-360M-v1"
device   = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model     = AutoModel.from_pretrained(model_id, trust_remote_code=True).to(device).eval()

# Grab one LibriSpeech test-clean clip (~3.5s, not in training) as the reference voice.
# Swap this block for `torchaudio.load("your_reference.wav")` to use your own clip.
sample  = next(iter(load_dataset("openslr/librispeech_asr", "clean", split="test", streaming=True)))
ref_wav = torch.from_numpy(np.asarray(sample["audio"]["array"], dtype=np.float32)).unsqueeze(0)
ref_sr  = sample["audio"]["sampling_rate"]
ref_codes = model.encode_audio(ref_wav, ref_sr)[:, :150]   # cap at ~12s; encode_audio resamples to 24 kHz

# Tokenize the target text and generate speech in the reference voice
inputs = processor("Hello world, how are you today?")
inputs = {k: v.to(device) for k, v in inputs.items()}

waveform = model.generate(
    **inputs,
    ref_codes=ref_codes,
    max_audio_frames=200,
    min_audio_frames=2,
    temperature=0.8, top_k=50, top_p=0.9,
    output_audio=True,
)
processor.save_audio(waveform, "out.wav")

Sampling tips

Defaults: temperature=0.8, top_k=50, top_p=0.9, max_audio_frames=200 (~16 s). If you hear the model generating extra speech past the intended text (hallucination):

  • Raise eos_bias β€” e.g. 2.0–6.0 β€” to make the model more eager to stop
  • Lower temperature (0.6) and top_p (0.8)
  • Set max_audio_frames β‰ˆ 12 * len(text_in_chars)
  • Set min_audio_frames=1 for very short prompts

Why delay pattern

Mimi uses residual vector quantization (RVQ): cb0 is semantic, cb1..cb7 encode successive residuals. cb_q is only meaningful given cb0..cb_{q-1}, so same-frame conditioning matters.

A flat interleaved layout (cb0_f0, cb1_f0, ..., cb0_f1, ...) preserves that conditioning best but balloons sequence length by kΓ— and forces k autoregressive LLM calls per frame. The delay pattern keeps RVQ conditioning (cb_q at frame f is predicted from a hidden state that has already attended over cb0..cb_{q-1} of the same frame) while cutting sequence length to T + k - 1 and LLM calls to one per step β€” enabling all 8 Mimi codebooks without blowing up context.

Limitations & known issues

  • Hallucinated continuations: occasionally generates plausible speech past the input text. Mitigate with eos_bias at inference.
  • English only.
  • Audiobook-style prosody inherited from LibriTTS-R; not as expressive as modern conversational TTS.
  • Small backbone (360M) β€” quality is below frontier TTS systems.
  • cb0 begins to overfit earlier than cb3–cb7; the released checkpoint is the best-epoch point (by overall val loss) from the full training run.

The Wren series

Wren is a family of compact (<3B parameter) multimodal speech LLMs β€” small enough to run on a single consumer GPU, designed for open research on unified speech understanding and synthesis. Planned siblings:

  • Wren-TTS β€” text β†’ speech (this release)
  • Wren-ASR β€” speech β†’ text
  • Wren-LM β€” speech-language modelling / dialog
  • Wren-Omni β€” unified ASR + TTS + LM in one checkpoint

All Wren models share the same design principles: small backbone LLM + neural audio codec, open weights, simple PyTorch checkpoints, reproducible training recipes.

Repository contents

File Purpose
model.safetensors Model weights
config.json WrenConfig (with auto_map for trust_remote_code)
tokenizer.json + friends SmolLM2 tokenizer with Wren's 3 special tokens added
processor_config.json WrenProcessor auto_map
configuration_wren.py WrenConfig(PretrainedConfig)
modeling_wren.py WrenForTTS(PreTrainedModel) β€” loads Mimi codec lazily on first generate
processing_wren.py WrenProcessor(ProcessorMixin) β€” tokenize + save_audio
README.md This model card

Citation

@misc{wren2026,
  title  = {Wren: A Family of Small Open-Weight Models for Unified Speech-Text Modelling},
  author = {Shangeth Rajaa},
  year   = {2026},
  url    = {https://github.com/shangeth/wren}
}

@inproceedings{koizumi2023libritts,
  title     = {LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus},
  author    = {Koizumi, Yuma and Zen, Heiga and Karita, Shigeki and Ding, Yifan
               and Yatabe, Kohei and Morioka, Nobuyuki and Bacchiani, Michiel and
               Zhang, Yu and Han, Wei and Bapna, Ankur},
  booktitle = {Interspeech},
  year      = {2023}
}

License

Apache-2.0 for the checkpoint weights and code in this repo. Upstream components carry their own licenses β€” review before redistribution.

Downloads last month
88
Safetensors
Model size
0.4B params
Tensor type
F32
Β·
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train shangeth/Wren-TTS-360M-v1

Space using shangeth/Wren-TTS-360M-v1 1

Collection including shangeth/Wren-TTS-360M-v1