YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Fine-Tuned Qwen3-TTS Model Checkpoint

Model Information

  • Speaker Name: hausa_speaker
  • Base Model: Qwen/Qwen3-TTS-12Hz-1.7B-Base
  • Number of Layers: 28
  • Hidden Size: 2048
  • Training Epoch: 2
  • Training Step: 6200
  • Best Validation Loss: 5.2648

Training Configuration

  • Learning Rate: 0.0005
  • Train Batch Size: 16
  • Validation Batch Size: 32
  • Gradient Accumulation Steps: 1
  • Weight Decay: 0.01
  • Warmup Steps: 100
  • Speaker Encoder Frozen: False
  • Layer Replacement: 0 layers replaced, 0 layers added
  • Original Layers Frozen: False

Files Included

  • config.json - Model configuration
  • generation_config.json - Generation parameters
  • model.safetensors - Model weights (includes speaker encoder weights)
  • tokenizer_config.json - Tokenizer configuration
  • vocab.json - Vocabulary file
  • merges.txt - BPE merges file
  • preprocessor_config.json - Preprocessor configuration
  • speech_tokenizer/ - Speech tokenizer model and config
  • speaker_encoder/speaker_config.json - Speaker encoder configuration
  • training_state.json - Training state and configuration

Loading the Model

from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel

# Load the fine-tuned model
model = Qwen3TTSModel.from_pretrained(
    "./output/best",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2"
)

# Generate speech with the new speaker
text = "Your text here"
ref_audio = "path/to/reference_audio.wav"

wavs, sr = model.generate_voice_clone(
    text=text,
    language="Auto",
    ref_audio=ref_audio,
    ref_text="Reference text for ICL mode",
    x_vector_only_mode=False
)

Speaker Information

The model has been fine-tuned for speaker: hausa_speaker Speaker embedding is stored at index 3000 in the codec embedding layer. Speaker encoder weights are included in the checkpoint and have been fine-tuned.

Notes

  • This model uses the Qwen3-TTS tokenizer
  • The model supports streaming generation
  • For best results, use reference audio from the same speaker used during training
  • The speaker encoder has been fine-tuned to better capture speaker characteristics
Downloads last month
421
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support