VoxCPM2 β€” ONNX (INT8)

2 B-parameter multilingual TTS with voice cloning and voice design. 48 kHz output.

Part of the soniqo.audio speech toolkit β€” an open, runtime-portable stack for speech AI. This bundle is the ONNX Runtime export, designed to plug into the abstract interfaces in speech-core (OnnxVoxCPM2Tts). Browse all ONNX bundles in the soniqo ONNX collection.

Use cases on soniqo.audio

ONNX export of openbmb/VoxCPM2 β€” a 2 B-parameter diffusion-autoregressive TTS with 48 kHz studio-quality output, reference-audio voice cloning, and natural-language voice design. Drop-in replacement for the LiteRT bundle on the synth worker side; same graph topology, same I/O contracts, runs on the ONNX Runtime CPU EP today (CUDA EP wired in the wrapper for GPU swap).

Why ONNX

This export targets ONNX Runtime as a complement to the LiteRT bundle. Both use the same four-graph split; on a CPU-only workload ONNX Runtime gives us:

  • ~28 % lower peak RSS during inference (8.2 GiB vs 11.5 GiB after load, 9.3 GiB vs 13.0 GiB peak β€” measured on a Mac CPU, same prompt, same step count). On a memory-constrained synth pod the difference is the one between fitting and not fitting.
  • ~2.4Γ— lower per-step latency (110 ms vs 266 ms per AR step on the same hardware) β€” XNNPACK INT8 path in ORT 1.26 is more aggressive about constant-folding the dequant.
  • A clean path to GPU acceleration via the CUDA EP without re-exporting the bundle.

Pipeline

VoxCPM2 is not a single feed-forward model. The runtime loop is

text + optional instruction ──► text-prefill
                                      β”‚
                                      β–Ό
                              repeated token-step  (KV cache rolls per step)
                                      β”‚
                                      β–Ό
                              audio-decoder ──► 48 kHz PCM

The host owns the loop and the KV cache; ONNX owns the static tensor programs. Same split as the LiteRT bundle in this collection β€” same host-side wrapper code, just a different runtime backend.

Files

File Size Description
voxcpm2-text-prefill.onnx + .onnx.data 2.4 GB INT8 weight-only text + instruction prefill (MiniCPM-4 KV-cache producer). max_text_tokens = 512.
voxcpm2-token-step.onnx + .onnx.data 2.8 GB INT8 weight-only autoregressive step (MiniCPM-4 + residual LM, KV-cache in/out, CFM Euler decoder).
voxcpm2-audio-encoder.onnx 183 MB FP32 reference-audio encoder (16 kHz @ 6.4 s β†’ 40 latent frames, voice-cloning only).
voxcpm2-audio-decoder.onnx 175 MB FP32 AudioVAE decoder (acoustic tokens β†’ 48 kHz PCM, 10.24 s window).
tokenizer.json / tokenizer_config.json / special_tokens_map.json β€” HF tokenizer bundle.
generation_config.json / tokenization_voxcpm2.py β€” Generation defaults + tokenizer module.
config.json β€” Model config (architecture, dims, IO shapes per graph).

Quantization recipe: onnxruntime.quantization.quantize_dynamic with weight_type=QInt8, op_types_to_quantize=["MatMul", "Gemm"]. Activations stay FP32. AudioVAE stays FP32 (Conv-heavy; dynamic INT8 rejects Conv axis remapping β€” same lesson as Parakeet's decoder-joint).

The .onnx.data files are external-data sidecars (the production weights exceed the 2 GB protobuf serialization cap). ORT's InferenceSession auto-resolves them from the protobuf's external_data references with no special SessionOptions.

Quick start (Python)

import onnxruntime as ort
from transformers import AutoTokenizer

bundle = "soniqo/VoxCPM2-ONNX"
tokenizer = AutoTokenizer.from_pretrained(bundle, trust_remote_code=True)
prefill  = ort.InferenceSession(f"{bundle}/voxcpm2-text-prefill.onnx",
                                providers=["CPUExecutionProvider"])
step     = ort.InferenceSession(f"{bundle}/voxcpm2-token-step.onnx",
                                providers=["CPUExecutionProvider"])
encoder  = ort.InferenceSession(f"{bundle}/voxcpm2-audio-encoder.onnx",
                                providers=["CPUExecutionProvider"])
decoder  = ort.InferenceSession(f"{bundle}/voxcpm2-audio-decoder.onnx",
                                providers=["CPUExecutionProvider"])

# ... see the speech-core OnnxVoxCPM2Tts wrapper for the full AR loop.

For a complete reference implementation see OnnxVoxCPM2Tts in speech-core.

License

Apache 2.0, inherited from upstream openbmb/VoxCPM2. Apache 2.0 covers both the weights and any exported derivative; verify against the upstream model card before commercial use.

Citation

@misc{openbmb-voxcpm2,
  author = {OpenBMB},
  title  = {{VoxCPM2}: a 2B-parameter diffusion-autoregressive multilingual TTS},
  year   = {2025},
  howpublished = {\url{https://huggingface.co/openbmb/VoxCPM2}}
}
Downloads last month
36
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for soniqo/VoxCPM2-ONNX

Base model

openbmb/VoxCPM2
Quantized
(7)
this model

Collection including soniqo/VoxCPM2-ONNX