IndicConformer 600M — ONNX (repackaged for Vernacula)

This repo republishes AI4Bharat's 22-language ai4bharat/indic-conformer-600m-multilingual as a single self-contained ONNX shipping package, in the shape the Vernacula desktop ASR app expects its on-disk model directories to be. The CTC head only — the RNNT components from the source repo are not shipped here.

All numerical behavior is identical to the upstream encoder+CTC graph; only the on-disk packaging differs.

File	Purpose
`encoder-model.onnx` (+ `.data` sidecar)	Conformer encoder, `[features, features_lens] -> [encoded, encoded_lens]`
`ctc_decoder-model.onnx`	Single `Conv1d` → 5633-dim logits (22 × 256 language tokens + 1 shared CTC blank at id 5632)
`nemo128.onnx`	DFT-conv1d 80-mel preprocessor, `[waveforms, waveforms_lens] -> [features, features_lens]`
`vocab.txt`	Flat 5632-line vocab, id = line index; shared CTC blank is implicit at id 5632
`language_spans.json`	22 × `{start, length}` — which slice of `vocab.txt` each language's 256 tokens occupy
`config.json`	Preprocessor frontend params + CTC blank id
`manifest.json`	Per-file MD5 hashes (used by Vernacula's download verifier)

Transformations applied vs upstream

Encoder ONNX: consolidated the ~360 per-tensor external-data blob files (HF's xet layout in the upstream repo) into a single encoder-model.onnx.data sidecar so the file set is manageable. Also resolved external data from Constant-node attributes as well as graph initializers.
Renamed ONNX IO tensors so one C# backend loads either this 600M package or a NeMo-fork 120M export without branching:
- Encoder: audio_signal → features, length → features_lens, outputs → encoded, encoded_lengths → encoded_lens.
- CTC decoder: encoder_output → encoded, logprobs → logits.
Vocab flatten: upstream vocab.json is a 22-key dict with 257 entries each ([<unk>, t1..t256]). Flattened to a single 5632-line vocab.txt keeping <unk> at local index 0 and the 255 real tokens at 1..255 per language. The 257th upstream slot is unused padding mirroring the RNNT head layout; it would never be decoded by the 256-dim CTC softmax.
Masks → spans: upstream language_masks.json is 22 per-language length-5633 boolean arrays. Verified they resolve to contiguous 256-token ranges, then compressed to 22 × {start, length} entries.
Preprocessor: upstream ships TorchScript (preprocessor.ts). We replace it with a custom DFT-conv1d ONNX graph (no STFT op — ONNX Runtime's STFT diverges from PyTorch's on current toolchains). The frontend config is byte-identical to upstream: sample_rate 16 kHz, 80 mel, n_fft 512, hop 160, win 400, hann, preemph 0.97, log+add guard, per-feature normalize, power spectrogram.
Parity verified end-to-end against upstream on a Hindi Fleurs clip — decodes to readable Devanagari with realistic WER; the full pipeline (nemo128 → encoder → ctc_decoder) is numerically equivalent to running AI4Bharat's reference model_onnx.py against the original assets/*.onnx files.

Citation

Original model by AI4Bharat. Please cite their work when using this repackaged copy; see their model card for details.

License

MIT, same as upstream.

Downloads last month: 13

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for christopherthompson81/indicconformer-600m-onnx

Base model

ai4bharat/indic-conformer-600m-multilingual

Quantized

(5)

this model

christopherthompson81
/

indicconformer-600m-onnx

IndicConformer 600M — ONNX (repackaged for Vernacula)

Contents

Transformations applied vs upstream

Citation

License

Model tree for christopherthompson81/indicconformer-600m-onnx