HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading Paper • 2502.12574 • Published Feb 18, 2025 • 13
SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference Paper • 2502.18137 • Published Feb 25, 2025 • 60
view article Article MInference 1.0: 10x Faster Million Context Inference with a Single GPU Jul 11, 2024 • 14
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning Paper • 2505.24726 • Published May 30, 2025 • 279
PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs Paper • 2410.05265 • Published Oct 7, 2024 • 33
LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation Paper • 2503.19950 • Published Mar 25, 2025 • 12
Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads Paper • 2501.15113 • Published Jan 25, 2025 • 1
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference Paper • 2409.04992 • Published Sep 8, 2024 • 2