VoxCPM2 — ONNX (INT8)

2 B-parameter multilingual TTS with voice cloning and voice design. 48 kHz output.

Part of the soniqo.audio speech toolkit — an open, runtime-portable stack for speech AI. This bundle is the ONNX Runtime export, designed to plug into the abstract interfaces in speech-core (OnnxVoxCPM2Tts). Browse all ONNX bundles in the soniqo ONNX collection.

Use cases on soniqo.audio

ONNX export of openbmb/VoxCPM2 — a 2 B-parameter diffusion-autoregressive TTS with 48 kHz studio-quality output, reference-audio voice cloning, and natural-language voice design. Drop-in replacement for the LiteRT bundle on the synth worker side; same graph topology, same I/O contracts, runs on the ONNX Runtime CPU EP today (CUDA EP wired in the wrapper for GPU swap).

Why ONNX

This export targets ONNX Runtime as a complement to the LiteRT bundle. Both use the same four-graph split; on a CPU-only workload ONNX Runtime gives us:

~28 % lower peak RSS during inference (8.2 GiB vs 11.5 GiB after load, 9.3 GiB vs 13.0 GiB peak — measured on a Mac CPU, same prompt, same step count). On a memory-constrained synth pod the difference is the one between fitting and not fitting.
~2.4× lower per-step latency (110 ms vs 266 ms per AR step on the same hardware) — XNNPACK INT8 path in ORT 1.26 is more aggressive about constant-folding the dequant.
A clean path to GPU acceleration via the CUDA EP without re-exporting the bundle.

Pipeline

VoxCPM2 is not a single feed-forward model. The runtime loop is

text + optional instruction ──► text-prefill
                                      │
                                      ▼
                              repeated token-step  (KV cache rolls per step)
                                      │
                                      ▼
                              audio-decoder ──► 48 kHz PCM

The host owns the loop and the KV cache; ONNX owns the static tensor programs. Same split as the LiteRT bundle in this collection — same host-side wrapper code, just a different runtime backend.

Files

File	Size	Description
`voxcpm2-text-prefill.onnx` + `.onnx.data`	2.4 GB	INT8 weight-only text + instruction prefill (MiniCPM-4 KV-cache producer). `max_text_tokens = 512`.
`voxcpm2-token-step.onnx` + `.onnx.data`	2.8 GB	INT8 weight-only autoregressive step (MiniCPM-4 + residual LM, KV-cache in/out, CFM Euler decoder).
`voxcpm2-audio-encoder.onnx`	183 MB	FP32 reference-audio encoder (16 kHz @ 6.4 s → 40 latent frames, voice-cloning only).
`voxcpm2-audio-decoder.onnx`	175 MB	FP32 AudioVAE decoder (acoustic tokens → 48 kHz PCM, 10.24 s window).
`tokenizer.json` / `tokenizer_config.json` / `special_tokens_map.json`	—	HF tokenizer bundle.
`generation_config.json` / `tokenization_voxcpm2.py`	—	Generation defaults + tokenizer module.
`config.json`	—	Model config (architecture, dims, IO shapes per graph).

Quantization recipe: onnxruntime.quantization.quantize_dynamic with weight_type=QInt8, op_types_to_quantize=["MatMul", "Gemm"]. Activations stay FP32. AudioVAE stays FP32 (Conv-heavy; dynamic INT8 rejects Conv axis remapping — same lesson as Parakeet's decoder-joint).

The .onnx.data files are external-data sidecars (the production weights exceed the 2 GB protobuf serialization cap). ORT's InferenceSession auto-resolves them from the protobuf's external_data references with no special SessionOptions.

Quick start (Python)

import onnxruntime as ort
from transformers import AutoTokenizer

bundle = "soniqo/VoxCPM2-ONNX"
tokenizer = AutoTokenizer.from_pretrained(bundle, trust_remote_code=True)
prefill  = ort.InferenceSession(f"{bundle}/voxcpm2-text-prefill.onnx",
                                providers=["CPUExecutionProvider"])
step     = ort.InferenceSession(f"{bundle}/voxcpm2-token-step.onnx",
                                providers=["CPUExecutionProvider"])
encoder  = ort.InferenceSession(f"{bundle}/voxcpm2-audio-encoder.onnx",
                                providers=["CPUExecutionProvider"])
decoder  = ort.InferenceSession(f"{bundle}/voxcpm2-audio-decoder.onnx",
                                providers=["CPUExecutionProvider"])

# ... see the speech-core OnnxVoxCPM2Tts wrapper for the full AR loop.

For a complete reference implementation see OnnxVoxCPM2Tts in speech-core.

License

Apache 2.0, inherited from upstream openbmb/VoxCPM2. Apache 2.0 covers both the weights and any exported derivative; verify against the upstream model card before commercial use.

Citation

@misc{openbmb-voxcpm2,
  author = {OpenBMB},
  title  = {{VoxCPM2}: a 2B-parameter diffusion-autoregressive multilingual TTS},
  year   = {2025},
  howpublished = {\url{https://huggingface.co/openbmb/VoxCPM2}}
}

Downloads last month: 36

Model tree for soniqo/VoxCPM2-ONNX

Base model

openbmb/VoxCPM2

Quantized

(7)

this model

Collection including soniqo/VoxCPM2-ONNX

ONNX

Collection

ONNX bundles for soniqo.audio. VAD, speech enhancement, ASR, TTS — for Android via ONNX Runtime and cross-platform consumers. • 8 items • Updated about 23 hours ago • 1