KVHALO: Key-Value Cache Manifold Super-Resolution Reconstructor

KVHALO (Key-Value High-Fidelity Attenuation Latent Optimization) is a novel, ultra-compact 92-Million Parameter Auxiliary Model designed to eliminate the linear memory scaling bottleneck of LLM inference. By leveraging a specialized Hierarchical Reasoning Model (HRM) core, KVHALO intercepts heavily degraded 1-bit or 2-bit quantized KV cache representations and reconstructs the high-fidelity, continuous 16-bit floating-point tensor manifold on the fly.

During autoregressive token generation, reconstructed continuous vectors are temporarily expanded to satisfy the geometric expectations of the main model attention heads, then instantly evicted from VRAM—safeguarding deep context-steering stability while keeping the memory footprint ultra-lean.

Quick System Telemetry & Metrics

Reconstruction Fidelity ($D_{cos}$): Peak 91.85% absolute directional cosine alignment reached at deep convergence (Floor Loss: 0.1630).
VRAM Efficiency Allocation: Drops Layer 15 attention cache footprints from 5.92 MB down to 633.75 KB for local processing runs.
Enterprise Scaling Factor: Projected savings of ~6.06 GB of physical VRAM per concurrent 32-user batch stream over a full 32-layer model profile.
Throughput Parity: Zero measurable latency overhead. Executes at an identical local velocity threshold (~6.31 tokens/sec on Apple Silicon edge environments).

Model Initialization / Modelfile Usage

To see model usage, please go to the Github Repository

Citation & License

This project is licensed under Apache 2.0.

Downloads last month: -; Downloads are not tracked for this model. How to track

richyvd
/

kvhalo

KVHALO: Key-Value Cache Manifold Super-Resolution Reconstructor

Quick System Telemetry & Metrics

Model Initialization / Modelfile Usage

Citation & License

Dataset used to train richyvd/kvhalo