VoxCPM2 β ONNX (INT8)
2 B-parameter multilingual TTS with voice cloning and voice design. 48 kHz output.
Part of the soniqo.audio speech toolkit β an open, runtime-portable stack for speech AI. This bundle is the ONNX Runtime export, designed to plug into the abstract interfaces in
speech-core(OnnxVoxCPM2Tts). Browse all ONNX bundles in the soniqo ONNX collection.
Use cases on soniqo.audio
ONNX export of openbmb/VoxCPM2 β a 2 B-parameter diffusion-autoregressive TTS with 48 kHz studio-quality output, reference-audio voice cloning, and natural-language voice design. Drop-in replacement for the LiteRT bundle on the synth worker side; same graph topology, same I/O contracts, runs on the ONNX Runtime CPU EP today (CUDA EP wired in the wrapper for GPU swap).
Why ONNX
This export targets ONNX Runtime as a complement to the LiteRT bundle. Both use the same four-graph split; on a CPU-only workload ONNX Runtime gives us:
- ~28 % lower peak RSS during inference (8.2 GiB vs 11.5 GiB after load, 9.3 GiB vs 13.0 GiB peak β measured on a Mac CPU, same prompt, same step count). On a memory-constrained synth pod the difference is the one between fitting and not fitting.
- ~2.4Γ lower per-step latency (110 ms vs 266 ms per AR step on the same hardware) β XNNPACK INT8 path in ORT 1.26 is more aggressive about constant-folding the dequant.
- A clean path to GPU acceleration via the CUDA EP without re-exporting the bundle.
Pipeline
VoxCPM2 is not a single feed-forward model. The runtime loop is
text + optional instruction βββΊ text-prefill
β
βΌ
repeated token-step (KV cache rolls per step)
β
βΌ
audio-decoder βββΊ 48 kHz PCM
The host owns the loop and the KV cache; ONNX owns the static tensor programs. Same split as the LiteRT bundle in this collection β same host-side wrapper code, just a different runtime backend.
Files
| File | Size | Description |
|---|---|---|
voxcpm2-text-prefill.onnx + .onnx.data |
2.4 GB | INT8 weight-only text + instruction prefill (MiniCPM-4 KV-cache producer). max_text_tokens = 512. |
voxcpm2-token-step.onnx + .onnx.data |
2.8 GB | INT8 weight-only autoregressive step (MiniCPM-4 + residual LM, KV-cache in/out, CFM Euler decoder). |
voxcpm2-audio-encoder.onnx |
183 MB | FP32 reference-audio encoder (16 kHz @ 6.4 s β 40 latent frames, voice-cloning only). |
voxcpm2-audio-decoder.onnx |
175 MB | FP32 AudioVAE decoder (acoustic tokens β 48 kHz PCM, 10.24 s window). |
tokenizer.json / tokenizer_config.json / special_tokens_map.json |
β | HF tokenizer bundle. |
generation_config.json / tokenization_voxcpm2.py |
β | Generation defaults + tokenizer module. |
config.json |
β | Model config (architecture, dims, IO shapes per graph). |
Quantization recipe: onnxruntime.quantization.quantize_dynamic with
weight_type=QInt8, op_types_to_quantize=["MatMul", "Gemm"]. Activations
stay FP32. AudioVAE stays FP32 (Conv-heavy; dynamic INT8 rejects Conv axis
remapping β same lesson as Parakeet's decoder-joint).
The .onnx.data files are external-data sidecars (the production
weights exceed the 2 GB protobuf serialization cap). ORT's
InferenceSession auto-resolves them from the protobuf's external_data
references with no special SessionOptions.
Quick start (Python)
import onnxruntime as ort
from transformers import AutoTokenizer
bundle = "soniqo/VoxCPM2-ONNX"
tokenizer = AutoTokenizer.from_pretrained(bundle, trust_remote_code=True)
prefill = ort.InferenceSession(f"{bundle}/voxcpm2-text-prefill.onnx",
providers=["CPUExecutionProvider"])
step = ort.InferenceSession(f"{bundle}/voxcpm2-token-step.onnx",
providers=["CPUExecutionProvider"])
encoder = ort.InferenceSession(f"{bundle}/voxcpm2-audio-encoder.onnx",
providers=["CPUExecutionProvider"])
decoder = ort.InferenceSession(f"{bundle}/voxcpm2-audio-decoder.onnx",
providers=["CPUExecutionProvider"])
# ... see the speech-core OnnxVoxCPM2Tts wrapper for the full AR loop.
For a complete reference implementation see
OnnxVoxCPM2Tts
in speech-core.
License
Apache 2.0, inherited from upstream openbmb/VoxCPM2. Apache 2.0 covers both the weights and any exported derivative; verify against the upstream model card before commercial use.
Citation
@misc{openbmb-voxcpm2,
author = {OpenBMB},
title = {{VoxCPM2}: a 2B-parameter diffusion-autoregressive multilingual TTS},
year = {2025},
howpublished = {\url{https://huggingface.co/openbmb/VoxCPM2}}
}
- Downloads last month
- 36
Model tree for soniqo/VoxCPM2-ONNX
Base model
openbmb/VoxCPM2