KVHALO: Key-Value Cache Manifold Super-Resolution Reconstructor
KVHALO (Key-Value High-Fidelity Attenuation Latent Optimization) is a novel, ultra-compact 92-Million Parameter Auxiliary Model designed to eliminate the linear memory scaling bottleneck of LLM inference. By leveraging a specialized Hierarchical Reasoning Model (HRM) core, KVHALO intercepts heavily degraded 1-bit or 2-bit quantized KV cache representations and reconstructs the high-fidelity, continuous 16-bit floating-point tensor manifold on the fly.
During autoregressive token generation, reconstructed continuous vectors are temporarily expanded to satisfy the geometric expectations of the main model attention heads, then instantly evicted from VRAM—safeguarding deep context-steering stability while keeping the memory footprint ultra-lean.
Quick System Telemetry & Metrics
- Reconstruction Fidelity ($D_{cos}$): Peak 91.85% absolute directional cosine alignment reached at deep convergence (Floor Loss: 0.1630).
- VRAM Efficiency Allocation: Drops Layer 15 attention cache footprints from 5.92 MB down to 633.75 KB for local processing runs.
- Enterprise Scaling Factor: Projected savings of ~6.06 GB of physical VRAM per concurrent 32-user batch stream over a full 32-layer model profile.
- Throughput Parity: Zero measurable latency overhead. Executes at an identical local velocity threshold (~6.31 tokens/sec on Apple Silicon edge environments).
Model Initialization / Modelfile Usage
To see model usage, please go to the Github Repository
Citation & License
This project is licensed under Apache 2.0.