ONNX
Safetensors

SVG Banners

👉🏻 CosyVoice 👈🏻

Fun-CosyVoice 3.0: Demos; Paper; Modelscope; Huggingface; CV3-Eval

CosyVoice 2.0: Demos; Paper; Modelscope; HuggingFace

CosyVoice 1.0: Demos; Paper; Modelscope; HuggingFace

Highlight🔥

Fun-CosyVoice 3.0 is an advanced text-to-speech (TTS) system based on large language models (LLM), surpassing its predecessor (CosyVoice 2.0) in content consistency, speaker similarity, and prosody naturalness. It is designed for zero-shot multilingual speech synthesis in the wild.

Key Features

  • Language Coverage: Covers 9 common languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects/accents (Guangdong, Minnan, Sichuan, Dongbei, Shan3xi, Shan1xi, Shanghai, Tianjin, Shandong, Ningxia, Gansu, etc.) and meanwhile supports both multi-lingual/cross-lingual zero-shot voice cloning.
  • Content Consistency & Naturalness: Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness.
  • Pronunciation Inpainting: Supports pronunciation inpainting of Chinese Pinyin and English CMU phonemes, providing more controllability and thus suitable for production use.
  • Text Normalization: Supports reading of numbers, special symbols and various text formats without a traditional frontend module.
  • Bi-Streaming: Support both text-in streaming and audio-out streaming, and achieves latency as low as 150ms while maintaining high-quality audio output.
  • Instruct Support: Supports various instructions such as languages, dialects, emotions, speed, volume, etc.

Roadmap

  • 2025/12

    • release Fun-CosyVoice3-0.5B-2512 base model, rl model and its training/inference script
    • release Fun-CosyVoice3-0.5B modelscope gradio space
  • 2025/08

    • Thanks to the contribution from NVIDIA Yuekai Zhang, add triton trtllm runtime support and cosyvoice2 grpo training support
  • 2025/07

    • release Fun-CosyVoice 3.0 eval set
  • 2025/05

    • add CosyVoice2-0.5B vllm support
  • 2024/12

    • 25hz CosyVoice2-0.5B released
  • 2024/09

    • 25hz CosyVoice-300M base model
    • 25hz CosyVoice-300M voice conversion function
  • 2024/08

    • Repetition Aware Sampling(RAS) inference for llm stability
    • Streaming inference mode support, including kv cache and sdpa for rtf optimization
  • 2024/07

    • Flow matching training support
    • WeTextProcessing support when ttsfrd is not available
    • Fastapi server and client

Evaluation

Model Open-Source Model Size test-zh
CER (%) ↓
test-zh
Speaker Similarity (%) ↑
test-en
WER (%) ↓
test-en
Speaker Similarity (%) ↑
test-hard
CER (%) ↓
test-hard
Speaker Similarity (%) ↑
Human - - 1.26 75.5 2.14 73.4 - -
Seed-TTS - 1.12 79.6 2.25 76.2 7.59 77.6
MiniMax-Speech - 0.83 78.3 1.65 69.2 - -
F5-TTS 0.3B 1.52 74.1 2.00 64.7 8.67 71.3
Spark TTS 0.5B 1.2 66.0 1.98 57.3 - -
CosyVoice2 0.5B 1.45 75.7 2.57 65.9 6.83 72.4
FireRedTTS2 1.5B 1.14 73.2 1.95 66.5 - -
Index-TTS2 1.5B 1.03 76.5 2.23 70.6 7.12 75.5
VibeVoice-1.5B 1.5B 1.16 74.4 3.04 68.9 - -
VibeVoice-Realtime 0.5B - - 2.05 63.3 - -
HiggsAudio-v2 3B 1.50 74.0 2.44 67.7 - -
VoxCPM 0.5B 0.93 77.2 1.85 72.9 8.87 73.0
GLM-TTS 1.5B 1.03 76.1 - - - -
GLM-TTS RL 1.5B 0.89 76.4 - - - -
Fun-CosyVoice3-0.5B-2512 0.5B 1.21 78.0 2.24 71.8 6.71 75.8
Fun-CosyVoice3-0.5B-2512_RL 0.5B 0.81 77.4 1.68 69.5 5.44 75.0

Install

Clone and install

  • Clone the repo

    git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
    # If you failed to clone the submodule due to network failures, please run the following command until success
    cd CosyVoice
    git submodule update --init --recursive
    
  • Install Conda: please see https://docs.conda.io/en/latest/miniconda.html

  • Create Conda env:

    conda create -n cosyvoice -y python=3.10
    conda activate cosyvoice
    pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
    
    # If you encounter sox compatibility issues
    # ubuntu
    sudo apt-get install sox libsox-dev
    # centos
    sudo yum install sox sox-devel
    

Model download

from huggingface_hub import snapshot_download
snapshot_download('FunAudioLLM/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')

Optionally, you can unzip ttsfrd resource and install ttsfrd package for better text normalization performance.

Notice that this step is not necessary. If you do not install ttsfrd package, we will use wetext by default.

cd pretrained_models/CosyVoice-ttsfrd/
unzip resource.zip -d .
pip install ttsfrd_dependency-0.1-py3-none-any.whl
pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl

Basic Usage

import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import AutoModel
import torchaudio

""" CosyVoice2 Usage, check https://funaudiollm.github.io/cosyvoice2/ for more details
"""
cosyvoice = AutoModel(model_dir='pretrained_models/CosyVoice2-0.5B')

# NOTE if you want to reproduce the results on https://funaudiollm.github.io/cosyvoice2, please add text_frontend=False during inference
# en zero_shot usage
for i, j in enumerate(cosyvoice.inference_zero_shot('CosyVoice is undergoing a comprehensive upgrade, providing more accurate, stable, faster, and better voice generation capabilities.', '希望你以后能够做的比我还好呦。', './asset/zero_shot_prompt.wav')):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
# zh zero_shot usage
for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '希望你以后能够做的比我还好呦。', './asset/zero_shot_prompt.wav')):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

# save zero_shot spk for future usage
assert cosyvoice.add_zero_shot_spk('希望你以后能够做的比我还好呦。', './asset/zero_shot_prompt.wav', 'my_zero_shot_spk') is True
for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '', '', zero_shot_spk_id='my_zero_shot_spk')):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
cosyvoice.save_spkinfo()

# fine grained control, for supported control, check cosyvoice/tokenizer/tokenizer.py#L248
for i, j in enumerate(cosyvoice.inference_cross_lingual('在他讲述那个荒诞故事的过程中,他突然[laughter]停下来,因为他自己也被逗笑了[laughter]。', './asset/zero_shot_prompt.wav')):
    torchaudio.save('fine_grained_control_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

# instruct usage
for i, j in enumerate(cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '用四川话说这句话<|endofprompt|>', './asset/zero_shot_prompt.wav')):
    torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

# bistream usage, you can use generator as input, this is useful when using text llm model as input
# NOTE you should still have some basic sentence split logic because llm can not handle arbitrary sentence length
def text_generator():
    yield '收到好友从远方寄来的生日礼物,'
    yield '那份意外的惊喜与深深的祝福'
    yield '让我心中充满了甜蜜的快乐,'
    yield '笑容如花儿般绽放。'
for i, j in enumerate(cosyvoice.inference_zero_shot(text_generator(), '希望你以后能够做的比我还好呦。', './asset/zero_shot_prompt.wav', stream=False)):
    torchaudio.save('zero_shot_bistream_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

Discussion & Communication

You can directly discuss on Github Issues.

You can also scan the QR code to join our official Dingding chat group.

Acknowledge

  1. We borrowed a lot of code from FunASR.
  2. We borrowed a lot of code from FunCodec.
  3. We borrowed a lot of code from Matcha-TTS.
  4. We borrowed a lot of code from AcademiCodec.
  5. We borrowed a lot of code from WeNet.

Citations

@article{du2024cosyvoice,
  title={Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens},
  author={Du, Zhihao and Chen, Qian and Zhang, Shiliang and Hu, Kai and Lu, Heng and Yang, Yexin and Hu, Hangrui and Zheng, Siqi and Gu, Yue and Ma, Ziyang and others},
  journal={arXiv preprint arXiv:2407.05407},
  year={2024}
}

@article{du2024cosyvoice,
  title={Cosyvoice 2: Scalable streaming speech synthesis with large language models},
  author={Du, Zhihao and Wang, Yuxuan and Chen, Qian and Shi, Xian and Lv, Xiang and Zhao, Tianyu and Gao, Zhifu and Yang, Yexin and Gao, Changfeng and Wang, Hui and others},
  journal={arXiv preprint arXiv:2412.10117},
  year={2024}
}

@article{du2025cosyvoice,
  title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
  author={Du, Zhihao and Gao, Changfeng and Wang, Yuxuan and Yu, Fan and Zhao, Tianyu and Wang, Hao and Lv, Xiang and Wang, Hui and Shi, Xian and An, Keyu and others},
  journal={arXiv preprint arXiv:2505.17589},
  year={2025}
}

@inproceedings{lyu2025build,
  title={Build LLM-Based Zero-Shot Streaming TTS System with Cosyvoice},
  author={Lyu, Xiang and Wang, Yuxuan and Zhao, Tianyu and Wang, Hao and Liu, Huadai and Du, Zhihao},
  booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--2},
  year={2025},
  organization={IEEE}
}

Disclaimer

The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 4 Ask for provider support

Model tree for FunAudioLLM/CosyVoice2-0.5B

Adapters
1 model
Finetunes
6 models
Quantizations
5 models

Spaces using FunAudioLLM/CosyVoice2-0.5B 17