Safetensors
qwen3
custom_code
Files changed (1) hide show
  1. README.md +104 -3
README.md CHANGED
@@ -1,3 +1,104 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+ # ๐Ÿš€ ROCKET: Rapid Optimization via Calibration-guided Knapsack Enhanced Truncation for Efficient Model Compression
5
+ [![arXiv](https://img.shields.io/badge/arXiv-2310.12345-b31b1b.svg)](https://arxiv.org/abs/2602.11008)
6
+ [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
7
+ [![Python 3.9+](https://img.shields.io/badge/Python-3.9%2B-blue.svg)](https://www.python.org/downloads/)
8
+ [![Kaggle](https://img.shields.io/badge/Kaggle-Notebook-20beff?logo=kaggle&logoColor=white)]()
9
+ ![ROCKET Architecture](figs/lunch.JPG)
10
+ In a quiet corner of the AI research lab, a cartoon rocket stood on the launchpadโ€”bright red, cheerful, and boldly labeled โ€œLLM.โ€ At the control console sat a scientist, fingers hovering over a single, enormous red button marked โ€œSolve MCKP.โ€
11
+ With a deep breath and a flicker of hope, they pressed it.
12
+ The rocket roared to life. Flames erupted, scattering clouds of sparse matrices like confetti made of zeros. As the LLM blasted into the stratosphere of efficient inference, it left behind on the pad a humble knapsack overflowing not with gold, but with perfectly balanced (rank, sparsity) pairs: the optimal solutions to the Multiple-Choice Knapsack Problem, handpicked for model compression.
13
+ Up it soared lighter, faster, smarter carrying only what truly mattered.
14
+
15
+ ## Model Description
16
+ **ROCKET** (Rapid Optimization via Calibration-guided Knapsack Enhanced Truncation) is a novel **training-free** model compression method that achieves state-of-the-art performance by combining two key innovations:
17
+
18
+ - **Multi-Choice Knapsack Budget Allocation**: Formulates layer-wise compression as an optimization problem, selecting the optimal (rank, sparsity) configuration per layer to minimize total reconstruction error under a global parameter budget
19
+ - **Single-Step Sparse Factorization**: Uses calibration-guided structured sparsification with closed-form dictionary updates via least squares **bypassing iterative optimization, sparse coding, or backpropagation entirely**
20
+
21
+ The approach operates in whitened activation space, applies importance-weighted sparsification, and recovers compressed weights as a product of two factors compatible with standard dense linear algebra.
22
+
23
+ ## Key Features
24
+ - ๐Ÿš€ **Training-Free Compression**: No fine-tuning required; compresses LLMs in minutes using only a small calibration set (~256 samples)
25
+ - ๐ŸŽฏ **Optimal Budget Allocation**: Dynamic programming solves layer-wise compression allocation to preserve performance where it matters most
26
+ - โšก **Single-Step Factorization**: Replaces expensive K-SVD/OMP with eigen decomposition + closed-form least squares **96ร— faster** than baselines
27
+ - ๐Ÿ” **Union-of-Subspaces Flexibility**: Each output dimension activates a distinct subset of basis vectors, overcoming rigid low-rank constraints
28
+ - ๐Ÿ”Œ **Hardware-Compatible**: Produces structured sparse factorizations that merge seamlessly during inference
29
+
30
+ ## ๐Ÿ”ฅ Performance: Qwen3-14B โ†’ 8B Compression vs. Native Qwen3-8B
31
+
32
+ | Method | Compression | State | PIQA ๐Ÿง  | HellaSwag ๐Ÿ”„ | LAMBADA ๐Ÿฆ™ | ARC-e ๐Ÿ”ฌ | ARC-c ๐Ÿงฉ | SciQ ๐Ÿ“š | Race ๐Ÿ | MMLU ๐ŸŽ“ | Avg. Acc ๐Ÿ“Š | WikiText PPL โ†“ |
33
+ |--------|-------------|-------|---------|--------------|------------|----------|----------|---------|---------|---------|-------------|----------------|
34
+ | **Qwen3-14B** (dense) | โ€“ | baseline | 79.86 | 78.85 | 67.88 | 82.82 | 60.23 | 96.50 | 43.25 | 77.20 | **73.32** | 1.1E+01 |
35
+ | **Qwen3-8B** (dense) | โ€“ | baseline | 77.70 | 74.90 | 64.10 | 80.70 | 56.70 | 95.70 | 40.90 | 73.00 | **70.46** | 1.2E+01 |
36
+ | **ROCKET-Qwen3-8B** | 40% (14Bโ†’8B) | training-free | 72.68 | 62.63 | 70.26 | 67.76 | 44.19 | 91.20 | 39.80 | 59.99 | 63.56 | 2.5E+01 |
37
+ | **ROCKET-Qwen3-8B** (healed) โœจ | 40% + 30M tokens | light fine-tune | **78.51** ๐Ÿ† | **74.67** ๐Ÿ† | 65.55 | **75.29** ๐Ÿ† | **53.07** ๐Ÿ† | **93.50** ๐Ÿ† | 37.89 | **65.23** ๐Ÿ† | **67.96** ๐Ÿ† | 1.3E+01 ๐Ÿ† |
38
+
39
+ ### Key Takeaways:
40
+ - โœ… **Training-free ROCKET** retains ~90% of the native 8B model's accuracy (63.56 vs 70.46) with **zero fine-tuning**
41
+ - โœจ **With minimal healing** (30M tokens, fixed sparsity), ROCKET reaches **96.5% of native 8B performance**โ€”nearly matching a model trained from scratch
42
+ - ๐Ÿ“‰ Perplexity after healing (1.3E+01) is virtually identical to the native 8B model (1.2E+01)
43
+ - ๐Ÿ’ก This demonstrates a practical alternative to multi-size training: **train one large model, compress to any target size with ROCKET**
44
+
45
+ > ๐ŸŒ **Environmental Impact**: ROCKET consumes **100ร— less energy** and produces **23ร— lower COโ‚‚ emissions** than iterative dictionary learning baselines.
46
+
47
+ ![ROCKET](figs/logo.png)
48
+ ## Installation
49
+ We highly recommend using this docker image to ensure reproducability.
50
+ ```
51
+ pytorch/pytorch:2.7.1-cuda12.6-cudnn9-devel
52
+ ```
53
+ Then run
54
+ ```bash
55
+ pip install -e .
56
+ ```
57
+ ## Running
58
+ We provide multiple console entrypoints to run the full pipeline you can easily do
59
+ ```bash
60
+ rocket-run-pipeline --config "./rocket/config/default.yaml"
61
+ ```
62
+ you can use the sample <a href="./rocket/config/default.yaml">config</a> fie and modify it according to your requirements
63
+ Other entrypoint are:
64
+ ```bash
65
+ rocket-profile-layers --config CONFIG # To do profiling only
66
+ rocket-compress --config CONFIG #run compression only
67
+ rocket-evaluate --config CONFIG # Evaluation only
68
+ rocket-gather-activations --config CONFIG # Prepare Calibration data
69
+ ```
70
+
71
+ ## Inference optimized
72
+ Note that we provide in extra folder a modeling file to run the optimized verison which includes implementation of Macko and fuzed layers.
73
+ to use the optimized version after you finish compression you load the model from the modeling file and call optimize
74
+ ```python
75
+ from transformers import AutoModelForCausalLM, AutoTokenizer
76
+ from modeling_llama_svdllm_opt import LlamaForCausalLM
77
+ model = LlamaForCausalLM.from_pretrained("MODEL_PATH", device_map="cuda", torch_dtype="float16", compression_path="./cr_llama.json")
78
+ tokenizer = AutoTokenizer.from_pretrained("MODEL_PATH")
79
+ model.optimize()
80
+ model = torch.compile(model, mode="reduce-overhead", fullgraph=True)
81
+ ```
82
+
83
+ # Citation
84
+ If you use ROCKET in your research, please cite our paper:
85
+
86
+ ```bibtex
87
+ @article{ali2026rocket0,
88
+ title = {ROCKET: Rapid Optimization via Calibration-guided Knapsack Enhanced Truncation for Efficient Model Compression},
89
+ author = {Ammar Ali and Baher Mohammad and Denis Makhov and Dmitriy Shopkhoev and Magauiya Zhussip and Stamatios Lefkimmiatis},
90
+ year = {2026},
91
+ journal = {arXiv preprint arXiv: 2602.11008}
92
+ }
93
+
94
+ ```
95
+
96
+ Credit in inference optimization to :
97
+ ```bibtex
98
+ @article{macko2025macko0,
99
+ title = {MACKO: Sparse Matrix-Vector Multiplication for Low Sparsity},
100
+ author = {Vladimรญr Macko and Vladimรญr Boลพa},
101
+ year = {2025},
102
+ journal = {arXiv preprint arXiv: 2511.13061}
103
+ }
104
+ ```