GGUF + pure-C++ runtime in CrispASR — Gemma-4-E2B (ASR)
#29
by cstr - opened
We've added Gemma-4-E2B to CrispASR as the gemma4-e2b backend. C++ binary, GGUF — no Python, no transformers.
src/gemma4_e2b.cpp — USM Conformer audio encoder (12L, 1024d, chunked-local attention with relative position bias, ClippableLinear with QAT scalars, LightConv1d) + Gemma4 LLM decoder (35L, 1536d, GQA 8Q/1KV, per-layer embeddings, hybrid sliding/full attention with per-layer-type head_dim, GeGLU MLP).
Two non-obvious bits cost the most time during the port:
Gemma4ClippableLinear.forwardclamps every input AND output of every q/k/v/o/ffw_layer/lconv1d.linear with trained finite bounds (±5..±40). They're not training-time noise — skipping them collapsedaudio_layer_11cos to 0.51 vs HF. Patching HF locally to disable the clamps reproduced our pre-fix cos exactly, which gave us unambiguous attribution. Cos jumped to 0.97 once we wired the clamps in. 480 scalars are persisted per audio tower in the GGUF.- Audio FE is bit-different from Whisper-style —
frame_length=320, not 400; window/hop/normalisation all distinct. Not interchangeable with our sharedcore/mel.h.
Q4_K transcribes JFK perfectly:
"And so my fellow Americans ask not what your country can do for you, ask what you can do for your country."
Pre-quantised GGUFs (Apache-2.0): cstr/gemma4-e2b-it-GGUF
git clone https://github.com/CrispStrobe/CrispASR && cd CrispASR
cmake -S . -B build && cmake --build build -j8
./build/bin/crispasr --backend gemma4-e2b -m gemma4-e2b-it-q4_k.gguf -f audio.wav -osrt