ProtocolVoice models

Offline models for the ProtocolVoice Android app — voice transcription, speaker diarization, and on-device interview summarization.

All models run on the device, no cloud calls.

Russian ASR

File	Size	Purpose	Original source	License
`gigaam_v3_e2e_ctc_int8.onnx`	305 MB	Russian ASR with built-in punctuation	Sber/SaluteDevices GigaAM (v3, e2e CTC, int8-quantized)	MIT

English ASR

File	Size	Purpose	Original source	License
`en/whisper_base_en_encoder_int8.onnx`	28 MB	Whisper base.en encoder	openai/whisper via sherpa-onnx	MIT
`en/whisper_base_en_decoder_int8.onnx`	125 MB	Whisper base.en decoder	OpenAI Whisper via sherpa-onnx	MIT
`en/whisper_base_en_tokens.txt`	0.8 MB	Whisper tokens vocab	OpenAI Whisper	MIT

Speaker diarization (works for any language)

File	Size	Purpose	Original source	License
`speaker_embedding_camplus.onnx`	27 MB	Speaker embedding (CAM++) — recommended default	modelscope/3D-Speaker	Apache-2.0
`speaker_embedding.onnx`	111 MB	Speaker embedding (ERes2Net V1) — best quality	modelscope/3D-Speaker	Apache-2.0
`speaker_embedding_v2.onnx`	68 MB	Speaker embedding (ERes2NetV2)	modelscope/3D-Speaker	Apache-2.0

Russian summarization (Default tier — NER-based, no LLM)

File	Size	Purpose	Original source	License
`summary/navec_news.tar`	25 MB	Navec quantized word embeddings (250K Russian words, 300-dim, PQ-100)	natasha/navec	MIT
`summary/slovnet_ner.tar`	2.3 MB	Slovnet NER weights (WordCNN + CRF, PER/LOC/ORG)	natasha/slovnet	MIT

These two files together (28 MB total) enable offline Russian named entity recognition + LexRank-based extractive summarization. ProtocolVoice uses them to extract names, organizations, locations, and key quotes from interview transcripts. No LLM required — fully deterministic, factual extraction.

Manifest

File	Size	Purpose
`manifest.json`	< 2 KB	SHA-256 hashes and metadata for all models

Important — attribution

These are NOT new models — this repository redistributes existing models in formats convenient for mobile delivery. The original authors retain all credit and copyright. We did not train, fine-tune, or modify the model weights.

Please cite the original projects, not this redistribution:

GigaAM-v3 (Russian ASR): Sber AI, SaluteDevices — https://github.com/salute-developers/GigaAM
Whisper (English ASR): OpenAI — https://github.com/openai/whisper
3D-Speaker (CAM++, ERes2Net, ERes2NetV2): ModelScope, Alibaba — https://github.com/modelscope/3D-Speaker
Slovnet NER + Navec: Natasha project, Alexander Kukushkin — https://github.com/natasha/slovnet, https://github.com/natasha/navec
sherpa-onnx (ONNX runtime): Next-gen Kaldi (k2-fsa) — https://github.com/k2-fsa/sherpa-onnx

Why this redistribution

The ProtocolVoice mobile app needs to download these models on first run from a mirror that:

supports files larger than 100 MB without git-lfs limits,
has fast CDN reachable from Russia,
is the conventional hosting platform for ML models.

All redistributed files retain their original licenses. This README serves as the required attribution under those licenses.

How the app uses these models

ASR + diarization (loaded via sherpa-onnx):

App downloads .onnx files from https://huggingface.co/protocolvoice/asr-models/resolve/main/{filename}
Verifies SHA-256 against manifest.json
Loads via sherpa-onnx for offline inference

Summarization (Default tier, custom Kotlin port):

App downloads summary/navec_news.tar and summary/slovnet_ner.tar
Extracts both .tar archives into the app's private files directory
Loads weights into a pure-Kotlin reimplementation of Slovnet NER (no PyTorch, no Python — just FloatArray math): WordEmbedding → ShapeEmbedding → 3-layer Conv1D → Linear → CRF Viterbi
Combines NER output with TF-IDF + LexRank to extract top quotes, named entities, risks, and numerical data

Inference performance on Xiaomi 12T: ~6 seconds for a 17,900-word transcript (default tier, NER + LexRank, no LLM).

You can also use these files directly with the upstream libraries (sherpa-onnx, slovnet, navec) in any project that respects the original licenses.

Verifying integrity

import hashlib

with open("gigaam_v3_e2e_ctc_int8.onnx", "rb") as f:
    print(hashlib.sha256(f.read()).hexdigest())
# expected: 0aacb41f70f0f5aaac4b45dd430337b9e16b180f22c72af04db8516e7609c3c0

Hashes for all files are in manifest.json.

Optional: Pro tier (QVikhr 1.5B)

ProtocolVoice has an optional PRO tier that produces a literary, narrative summary using QVikhr-2.5-1.5B-Instruct-r (1.0 GB GGUF, runs via llama.cpp on-device). The PRO tier is layered on top of the Default tier — Default extracts facts, PRO turns them into a coherent narrative.

The QVikhr GGUF is not hosted in this repo — users download it directly from the Vikhrmodels HF org or from a separate mirror, on demand. The QVikhr authors retain copyright; please cite them, not us.

License

This repository's metadata, README, and packaging scripts are released under Apache-2.0. Each model file remains under its original license (see the tables above). By using a model, you accept its original license — not just this repository's.

Removal request

If you are an author of one of the upstream projects and have any concerns about this redistribution (attribution, hosting, anything else), please open a discussion on this Hugging Face repo or email the maintainers — the files will be amended or removed as requested.

Downloads last month: -; Downloads are not tracked for this model. How to track

protocolvoice
/

asr-models