YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Fine-Tuned Qwen3-TTS Model Checkpoint
Model Information
- Speaker Name: hausa_speaker
- Base Model: Qwen/Qwen3-TTS-12Hz-1.7B-Base
- Number of Layers: 28
- Hidden Size: 2048
- Training Epoch: 2
- Training Step: 6200
- Best Validation Loss: 5.2648
Training Configuration
- Learning Rate: 0.0005
- Train Batch Size: 16
- Validation Batch Size: 32
- Gradient Accumulation Steps: 1
- Weight Decay: 0.01
- Warmup Steps: 100
- Speaker Encoder Frozen: False
- Layer Replacement: 0 layers replaced, 0 layers added
- Original Layers Frozen: False
Files Included
config.json- Model configurationgeneration_config.json- Generation parametersmodel.safetensors- Model weights (includes speaker encoder weights)tokenizer_config.json- Tokenizer configurationvocab.json- Vocabulary filemerges.txt- BPE merges filepreprocessor_config.json- Preprocessor configurationspeech_tokenizer/- Speech tokenizer model and configspeaker_encoder/speaker_config.json- Speaker encoder configurationtraining_state.json- Training state and configuration
Loading the Model
from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel
# Load the fine-tuned model
model = Qwen3TTSModel.from_pretrained(
"./output/best",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2"
)
# Generate speech with the new speaker
text = "Your text here"
ref_audio = "path/to/reference_audio.wav"
wavs, sr = model.generate_voice_clone(
text=text,
language="Auto",
ref_audio=ref_audio,
ref_text="Reference text for ICL mode",
x_vector_only_mode=False
)
Speaker Information
The model has been fine-tuned for speaker: hausa_speaker Speaker embedding is stored at index 3000 in the codec embedding layer. Speaker encoder weights are included in the checkpoint and have been fine-tuned.
Notes
- This model uses the Qwen3-TTS tokenizer
- The model supports streaming generation
- For best results, use reference audio from the same speaker used during training
- The speaker encoder has been fine-tuned to better capture speaker characteristics
- Downloads last month
- 421
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support