YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Fine-Tuned Qwen3-TTS Model Checkpoint

Model Information

Speaker Name: hausa_speaker
Base Model: Qwen/Qwen3-TTS-12Hz-1.7B-Base
Number of Layers: 28
Hidden Size: 2048
Training Epoch: 2
Training Step: 6200
Best Validation Loss: 5.2648

Training Configuration

Learning Rate: 0.0005
Train Batch Size: 16
Validation Batch Size: 32
Gradient Accumulation Steps: 1
Weight Decay: 0.01
Warmup Steps: 100
Speaker Encoder Frozen: False
Layer Replacement: 0 layers replaced, 0 layers added
Original Layers Frozen: False

Files Included

config.json - Model configuration
generation_config.json - Generation parameters
model.safetensors - Model weights (includes speaker encoder weights)
tokenizer_config.json - Tokenizer configuration
vocab.json - Vocabulary file
merges.txt - BPE merges file
preprocessor_config.json - Preprocessor configuration
speech_tokenizer/ - Speech tokenizer model and config
speaker_encoder/speaker_config.json - Speaker encoder configuration
training_state.json - Training state and configuration

Loading the Model

from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel

# Load the fine-tuned model
model = Qwen3TTSModel.from_pretrained(
    "./output/best",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2"
)

# Generate speech with the new speaker
text = "Your text here"
ref_audio = "path/to/reference_audio.wav"

wavs, sr = model.generate_voice_clone(
    text=text,
    language="Auto",
    ref_audio=ref_audio,
    ref_text="Reference text for ICL mode",
    x_vector_only_mode=False
)

Speaker Information

The model has been fine-tuned for speaker: hausa_speaker Speaker embedding is stored at index 3000 in the codec embedding layer. Speaker encoder weights are included in the checkpoint and have been fine-tuned.

Notes

This model uses the Qwen3-TTS tokenizer
The model supports streaming generation
For best results, use reference audio from the same speaker used during training
The speaker encoder has been fine-tuned to better capture speaker characteristics

Downloads last month: 421

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support