YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

KVQuant is a methodology for efficient KV cache quantization that incorporates several innovations to acheive accurate low-precision quantization, thereby enabling efficient long context length inference.

TLDR: KVQuant addresses the memory bottleneck with long context length inference by quantizing the KV cache to low precision. KVQuant achieves high accuracy with low-precision KV cache quantization by considering several consistent patterns observed in cached KV values across different LLMs, and by developing methods to exploit these patterns, including:

Per-channel, Pre-RoPE Key quantization to better match the outlier channels in Keys
Non-Uniform Quantization (NUQ) to better represent the non-uniform activations
Dense-and-Sparse Quantization to mitigate the impacts of numerical outliers on quantization difficulty
Q-Norm to mitigate distribution shift at ultra low precisions (eg. 2-bit)
Attention-Sink Aware Quantization to avoid quantization error with the first token, which is disproportionately sensitive to quantization error

For more details please check out our paper.

Model description

Quantizer file for running DBRX with 4-bit KV cache using KVQuant.

Base Model: DBRX-Instruct
Bitwidth: 4-bit
Sparsity Level: 1%

license: mit

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for squeeze-ai-lab/dbrx-instruct-a4-s1

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Paper • 2401.18079 • Published Jan 31, 2024 • 8

squeeze-ai-lab
/

dbrx-instruct-a4-s1

Model description

Links

license: mit

Paper for squeeze-ai-lab/dbrx-instruct-a4-s1

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization