Semantic-DACVAE-Japanese

Semantic-DACVAE-Japanese is a fine-tuned version of facebook/dacvae-watermarked, specifically optimized for Japanese speech.

By incorporating WavLM semantic distillation inspired by the Semantic-VAE paper and performing additional training on Japanese datasets, this model achieves more natural reconstructions and higher subjective quality for Japanese audio compared to the original base model.

Furthermore, according to the Semantic-VAE paper, this semantic distillation approach should also improve the training efficiency and performance of downstream TTS models.

🌟 Overview

Base Model: facebook/dacvae-watermarked
Enhancements: Integrated WavLM semantic distillation inspired by Semantic-VAE paper.
Training Data: Fine-tuned explicitly on Japanese speech datasets.
License: MIT

📊 Evaluation

We evaluated the model using the UTMOSv2 metric to measure the quality of the reconstructed audio. Our model demonstrates a clear improvement in naturalness over the original base model across both tested datasets.

1. Emilia-YODAS (Japanese Subset)

Tested on 100 samples (not included in training) from the Japanese subset of amphion/Emilia-Dataset.

Audio Source	Mean UTMOSv2 (n=100)
Original Audio	2.2099
facebook/dacvae-watermarked	2.2841
Aratako/Semantic-DACVAE-Japanese	2.4812

2. Private Test Dataset (Japanese)

Tested on 100 private Japanese speech samples.

Audio Source	Mean UTMOSv2 (n=100)
Original Audio	2.0322
facebook/dacvae-watermarked	1.8775
Aratako/Semantic-DACVAE-Japanese	2.1629

🚀 Quick Start

Installation

First, set up your environment and install the official repository:

# Create a virtual environment
uv venv --python=3.10

# Install the official dacvae package
uv pip install git+https://github.com/facebookresearch/dacvae

Inference

Below is a basic example of inference.

import soundfile as sf
import torch
import torchaudio
from audiotools import AudioSignal
from dacvae import DACVAE
from huggingface_hub import hf_hub_download

# 1. Load the model
model = DACVAE.load(hf_hub_download(repo_id="Aratako/Semantic-DACVAE-Japanese", filename="weights.pth")).eval()

# Disable/bypass the default watermark since this model was fine-tuned without it
model.decoder.alpha = 0.0
model.decoder.watermark = lambda x, message=None, d=model.decoder: d.wm_model.encoder_block.forward_no_conv(x)

# 2. Load and preprocess audio
wav_np, sr = sf.read("input.wav", dtype="float32")
wav = torch.from_numpy(wav_np.T) if wav_np.ndim == 2 else torch.from_numpy(wav_np).unsqueeze(0)
wav = torchaudio.functional.resample(wav.mean(0, keepdim=True), sr, model.sample_rate)

signal = AudioSignal(wav.unsqueeze(0), model.sample_rate)
signal.normalize(-16.0)
signal.ensure_max_of_audio()
x = signal.audio_data.float()  # (1, 1, T)

# 3. Encode and Decode
with torch.no_grad():
    z = model.encoder(model._pad(x))
    z, _ = model.quantizer.in_proj(z).chunk(2, dim=1)
    y = model.decode(z)[0].cpu()

# 4. Save reconstructed audio
sf.write("recon.wav", y.squeeze(0).numpy(), model.sample_rate)

📜 Acknowledgements

Base Model: Weights derived from facebook/dacvae-watermarked.
Training Code: Leveraged the codebase from Descript Audio Codec (DAC).
Methodology: Semantic distillation approach heavily inspired by the Semantic-VAE paper.

🖊️ Citation

@misc{semantic-dacvae-japanese,
  author = {Chihiro Arata},
  title = {Semantic-DACVAE-Japanese: Audio VAE for Japanese Speech},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/Aratako/Semantic-DACVAE-Japanese}}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Aratako/Semantic-DACVAE-Japanese

Base model

facebook/dacvae-watermarked

Finetuned

(2)

this model

Paper for Aratako/Semantic-DACVAE-Japanese

Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis

Paper • 2509.22167 • Published Sep 26, 2025