IndicConformer 600M β€” ONNX (repackaged for Vernacula)

This repo republishes AI4Bharat's 22-language ai4bharat/indic-conformer-600m-multilingual as a single self-contained ONNX shipping package, in the shape the Vernacula desktop ASR app expects its on-disk model directories to be. The CTC head only β€” the RNNT components from the source repo are not shipped here.

All numerical behavior is identical to the upstream encoder+CTC graph; only the on-disk packaging differs.

Contents

File Purpose
encoder-model.onnx (+ .data sidecar) Conformer encoder, [features, features_lens] -> [encoded, encoded_lens]
ctc_decoder-model.onnx Single Conv1d β†’ 5633-dim logits (22 Γ— 256 language tokens + 1 shared CTC blank at id 5632)
nemo128.onnx DFT-conv1d 80-mel preprocessor, [waveforms, waveforms_lens] -> [features, features_lens]
vocab.txt Flat 5632-line vocab, id = line index; shared CTC blank is implicit at id 5632
language_spans.json 22 Γ— {start, length} β€” which slice of vocab.txt each language's 256 tokens occupy
config.json Preprocessor frontend params + CTC blank id
manifest.json Per-file MD5 hashes (used by Vernacula's download verifier)

Transformations applied vs upstream

  1. Encoder ONNX: consolidated the ~360 per-tensor external-data blob files (HF's xet layout in the upstream repo) into a single encoder-model.onnx.data sidecar so the file set is manageable. Also resolved external data from Constant-node attributes as well as graph initializers.
  2. Renamed ONNX IO tensors so one C# backend loads either this 600M package or a NeMo-fork 120M export without branching:
    • Encoder: audio_signal β†’ features, length β†’ features_lens, outputs β†’ encoded, encoded_lengths β†’ encoded_lens.
    • CTC decoder: encoder_output β†’ encoded, logprobs β†’ logits.
  3. Vocab flatten: upstream vocab.json is a 22-key dict with 257 entries each ([<unk>, t1..t256]). Flattened to a single 5632-line vocab.txt keeping <unk> at local index 0 and the 255 real tokens at 1..255 per language. The 257th upstream slot is unused padding mirroring the RNNT head layout; it would never be decoded by the 256-dim CTC softmax.
  4. Masks β†’ spans: upstream language_masks.json is 22 per-language length-5633 boolean arrays. Verified they resolve to contiguous 256-token ranges, then compressed to 22 Γ— {start, length} entries.
  5. Preprocessor: upstream ships TorchScript (preprocessor.ts). We replace it with a custom DFT-conv1d ONNX graph (no STFT op β€” ONNX Runtime's STFT diverges from PyTorch's on current toolchains). The frontend config is byte-identical to upstream: sample_rate 16 kHz, 80 mel, n_fft 512, hop 160, win 400, hann, preemph 0.97, log+add guard, per-feature normalize, power spectrogram.
  6. Parity verified end-to-end against upstream on a Hindi Fleurs clip β€” decodes to readable Devanagari with realistic WER; the full pipeline (nemo128 β†’ encoder β†’ ctc_decoder) is numerically equivalent to running AI4Bharat's reference model_onnx.py against the original assets/*.onnx files.

Citation

Original model by AI4Bharat. Please cite their work when using this repackaged copy; see their model card for details.

License

MIT, same as upstream.

Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for christopherthompson81/indicconformer-600m-onnx

Quantized
(5)
this model