fishaudio
/

s1-mini

Model card Files Files and versions

s1-mini / README.md

PoTaTo721's picture

Update README.md

f4b4450 verified 4 months ago

|

history blame contribute delete

2.86 kB

	---
	tags:
	- text-to-speech
	license: cc-by-nc-sa-4.0
	language:
	- zh
	- en
	- de
	- ja
	- fr
	- es
	- ko
	- ar
	- nl
	- ru
	- it
	- pl
	- pt
	pipeline_tag: text-to-speech
	inference: false
	extra_gated_prompt: >-
	You agree to not use the model to generate contents that violate DMCA or local
	laws.
	extra_gated_fields:
	Country: country
	Specific date: date_picker
	I agree to use this model for non-commercial use ONLY: checkbox
	---


	# FishAudio S1

	FishAudio S1 is a leading text-to-speech (TTS) model trained on more than 2 million hours of audio data in multiple languages.

	Supported languages:
	- English (en)
	- Chinese (zh)
	- Japanese (ja)
	- German (de)
	- French (fr)
	- Spanish (es)
	- Korean (ko)
	- Arabic (ar)
	- Russian (ru)
	- Dutch (nl)
	- Italian (it)
	- Polish (pl)
	- Portuguese (pt)

	Please refer to [Fish Speech Github](https://github.com/fishaudio/fish-speech) for more info.
	Demo available at [Fish Audio Playground](https://fish.audio).
	Visit the [Fish Audio website](https://openaudio.com) for blog & tech report.

	## Emotion and Tone Support

	FishAudio S1 supports a variety of emotional, tone, and special markers to enhance speech synthesis:

	1. Emotional markers:
	(angry) (sad) (disdainful) (excited) (surprised) (satisfied) (unhappy) (anxious) (hysterical) (delighted) (scared) (worried) (indifferent) (upset) (impatient) (nervous) (guilty) (scornful) (frustrated) (depressed) (panicked) (furious) (empathetic) (embarrassed) (reluctant) (disgusted) (keen) (moved) (proud) (relaxed) (grateful) (confident) (interested) (curious) (confused) (joyful) (disapproving) (negative) (denying) (astonished) (serious) (sarcastic) (conciliative) (comforting) (sincere) (sneering) (hesitating) (yielding) (painful) (awkward) (amused)

	2. Tone markers:
	(in a hurry tone) (shouting) (screaming) (whispering) (soft tone)

	3. Special markers:
	(laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting) (groaning) (crowd laughing) (background laughter) (audience laughing)

	Special markers with corresponding onomatopoeia:
	- Laughing: Ha,ha,ha
	- Chuckling: Hmm,hmm

	## Model Variants and Performance

	FishAudio S1 includes the following models:
	- S1 (4B, proprietary): The full-sized model.
	- S1-mini (0.5B): A distilled version of S1.

	Both S1 and S1-mini incorporate online Reinforcement Learning from Human Feedback (RLHF).

	Seed TTS Eval Metrics (English, auto eval, based on OpenAI gpt-4o-transcribe, speaker distance using Revai/pyannote-wespeaker-voxceleb-resnet34-LM):

	- S1:
	- WER (Word Error Rate): 0.008
	- CER (Character Error Rate): 0.004
	- Distance: 0.332
	- S1-mini:
	- WER (Word Error Rate): 0.011
	- CER (Character Error Rate): 0.005
	- Distance: 0.380

	## License

	This model is permissively licensed under the CC-BY-NC-SA-4.0 license.