ProtocolVoice models
Offline models for the ProtocolVoice Android app β voice transcription, speaker diarization, and on-device interview summarization.
All models run on the device, no cloud calls.
Contents
Russian ASR
| File | Size | Purpose | Original source | License |
|---|---|---|---|---|
gigaam_v3_e2e_ctc_int8.onnx |
305 MB | Russian ASR with built-in punctuation | Sber/SaluteDevices GigaAM (v3, e2e CTC, int8-quantized) | MIT |
English ASR
| File | Size | Purpose | Original source | License |
|---|---|---|---|---|
en/whisper_base_en_encoder_int8.onnx |
28 MB | Whisper base.en encoder | openai/whisper via sherpa-onnx | MIT |
en/whisper_base_en_decoder_int8.onnx |
125 MB | Whisper base.en decoder | OpenAI Whisper via sherpa-onnx | MIT |
en/whisper_base_en_tokens.txt |
0.8 MB | Whisper tokens vocab | OpenAI Whisper | MIT |
Speaker diarization (works for any language)
| File | Size | Purpose | Original source | License |
|---|---|---|---|---|
speaker_embedding_camplus.onnx |
27 MB | Speaker embedding (CAM++) β recommended default | modelscope/3D-Speaker | Apache-2.0 |
speaker_embedding.onnx |
111 MB | Speaker embedding (ERes2Net V1) β best quality | modelscope/3D-Speaker | Apache-2.0 |
speaker_embedding_v2.onnx |
68 MB | Speaker embedding (ERes2NetV2) | modelscope/3D-Speaker | Apache-2.0 |
Russian summarization (Default tier β NER-based, no LLM)
| File | Size | Purpose | Original source | License |
|---|---|---|---|---|
summary/navec_news.tar |
25 MB | Navec quantized word embeddings (250K Russian words, 300-dim, PQ-100) | natasha/navec | MIT |
summary/slovnet_ner.tar |
2.3 MB | Slovnet NER weights (WordCNN + CRF, PER/LOC/ORG) | natasha/slovnet | MIT |
These two files together (28 MB total) enable offline Russian named entity recognition + LexRank-based extractive summarization. ProtocolVoice uses them to extract names, organizations, locations, and key quotes from interview transcripts. No LLM required β fully deterministic, factual extraction.
Manifest
| File | Size | Purpose |
|---|---|---|
manifest.json |
< 2 KB | SHA-256 hashes and metadata for all models |
Important β attribution
These are NOT new models β this repository redistributes existing models in formats convenient for mobile delivery. The original authors retain all credit and copyright. We did not train, fine-tune, or modify the model weights.
Please cite the original projects, not this redistribution:
- GigaAM-v3 (Russian ASR): Sber AI, SaluteDevices β https://github.com/salute-developers/GigaAM
- Whisper (English ASR): OpenAI β https://github.com/openai/whisper
- 3D-Speaker (CAM++, ERes2Net, ERes2NetV2): ModelScope, Alibaba β https://github.com/modelscope/3D-Speaker
- Slovnet NER + Navec: Natasha project, Alexander Kukushkin β https://github.com/natasha/slovnet, https://github.com/natasha/navec
- sherpa-onnx (ONNX runtime): Next-gen Kaldi (k2-fsa) β https://github.com/k2-fsa/sherpa-onnx
Why this redistribution
The ProtocolVoice mobile app needs to download these models on first run from a mirror that:
- supports files larger than 100 MB without git-lfs limits,
- has fast CDN reachable from Russia,
- is the conventional hosting platform for ML models.
All redistributed files retain their original licenses. This README serves as the required attribution under those licenses.
How the app uses these models
ASR + diarization (loaded via sherpa-onnx):
- App downloads
.onnxfiles fromhttps://huggingface.co/protocolvoice/asr-models/resolve/main/{filename} - Verifies SHA-256 against
manifest.json - Loads via sherpa-onnx for offline inference
Summarization (Default tier, custom Kotlin port):
- App downloads
summary/navec_news.tarandsummary/slovnet_ner.tar - Extracts both
.tararchives into the app's private files directory - Loads weights into a pure-Kotlin reimplementation of Slovnet NER (no PyTorch, no Python β just FloatArray math): WordEmbedding β ShapeEmbedding β 3-layer Conv1D β Linear β CRF Viterbi
- Combines NER output with TF-IDF + LexRank to extract top quotes, named entities, risks, and numerical data
Inference performance on Xiaomi 12T: ~6 seconds for a 17,900-word transcript (default tier, NER + LexRank, no LLM).
You can also use these files directly with the upstream libraries (sherpa-onnx, slovnet, navec) in any project that respects the original licenses.
Verifying integrity
import hashlib
with open("gigaam_v3_e2e_ctc_int8.onnx", "rb") as f:
print(hashlib.sha256(f.read()).hexdigest())
# expected: 0aacb41f70f0f5aaac4b45dd430337b9e16b180f22c72af04db8516e7609c3c0
Hashes for all files are in manifest.json.
Optional: Pro tier (QVikhr 1.5B)
ProtocolVoice has an optional PRO tier that produces a literary, narrative summary using QVikhr-2.5-1.5B-Instruct-r (1.0 GB GGUF, runs via llama.cpp on-device). The PRO tier is layered on top of the Default tier β Default extracts facts, PRO turns them into a coherent narrative.
The QVikhr GGUF is not hosted in this repo β users download it directly from the Vikhrmodels HF org or from a separate mirror, on demand. The QVikhr authors retain copyright; please cite them, not us.
License
This repository's metadata, README, and packaging scripts are released under Apache-2.0. Each model file remains under its original license (see the tables above). By using a model, you accept its original license β not just this repository's.
Removal request
If you are an author of one of the upstream projects and have any concerns about this redistribution (attribution, hosting, anything else), please open a discussion on this Hugging Face repo or email the maintainers β the files will be amended or removed as requested.