GGUF + pure-C++ runtime in CrispASR — Gemma-4-E2B (ASR)

#29
by cstr - opened

We've added Gemma-4-E2B to CrispASR as the gemma4-e2b backend. C++ binary, GGUF — no Python, no transformers.

src/gemma4_e2b.cpp — USM Conformer audio encoder (12L, 1024d, chunked-local attention with relative position bias, ClippableLinear with QAT scalars, LightConv1d) + Gemma4 LLM decoder (35L, 1536d, GQA 8Q/1KV, per-layer embeddings, hybrid sliding/full attention with per-layer-type head_dim, GeGLU MLP).

Two non-obvious bits cost the most time during the port:

  1. Gemma4ClippableLinear.forward clamps every input AND output of every q/k/v/o/ffw_layer/lconv1d.linear with trained finite bounds (±5..±40). They're not training-time noise — skipping them collapsed audio_layer_11 cos to 0.51 vs HF. Patching HF locally to disable the clamps reproduced our pre-fix cos exactly, which gave us unambiguous attribution. Cos jumped to 0.97 once we wired the clamps in. 480 scalars are persisted per audio tower in the GGUF.
  2. Audio FE is bit-different from Whisper-styleframe_length=320, not 400; window/hop/normalisation all distinct. Not interchangeable with our shared core/mel.h.

Q4_K transcribes JFK perfectly:

"And so my fellow Americans ask not what your country can do for you, ask what you can do for your country."

Pre-quantised GGUFs (Apache-2.0): cstr/gemma4-e2b-it-GGUF

git clone https://github.com/CrispStrobe/CrispASR && cd CrispASR
cmake -S . -B build && cmake --build build -j8
./build/bin/crispasr --backend gemma4-e2b -m gemma4-e2b-it-q4_k.gguf -f audio.wav -osrt

Sign up or log in to comment