z-lab/DeepSeek-R1-Distill-Llama-8B-PARO
Pairwise Rotation Quantization for Efficient Reasoning LLM Inference
ParoQuant is the state-of-the-art INT4 quantization for LLMs. It closes the accuracy gap with FP16 while running at near-AWQ speed. Supports NVIDIA GPUs (vLLM, Transformers) and Apple Silicon (MLX). For more information, see https://github.com/z-lab/paroquant.
z-lab/DeepSeek-R1-Distill-Llama-8B-PARO is a 4-bit deepseek-ai/DeepSeek-R1-Distill-Llama-8B quantized with ParoQuant. Check out other ParoQuant models from the Hugging Face collection.
Quick Start
Installation
# NVIDIA GPU (CUDA 12.9)
pip install "paroquant[vllm]"
# NVIDIA GPU (CUDA 13.0)
pip install "paroquant[vllm]" "vllm==0.17.1" \
--extra-index-url https://wheels.vllm.ai/0.17.1/cu130 \
--extra-index-url https://download.pytorch.org/whl/cu130
# Apple Silicon
pip install "paroquant[mlx]"
Interactive Chat
python -m paroquant.cli.chat --model z-lab/DeepSeek-R1-Distill-Llama-8B-PARO
OpenAI-Compatible API Server
python -m paroquant.cli.serve --model z-lab/DeepSeek-R1-Distill-Llama-8B-PARO --port 8000
Add --llm-only if you do not wish to load the VLM components.
Docker (NVIDIA GPU)
The following commands map the local cache directory to the container in order to persist kernel cache across runs. Remove
-v ...to disable this behaviour.
# Interactive chat
docker run --pull=always --rm -it --gpus all --ipc=host \
-v $HOME/.cache/paroquant:/root/.cache/paroquant \
ghcr.io/z-lab/paroquant:chat --model z-lab/DeepSeek-R1-Distill-Llama-8B-PARO
# API server (port 8000)
docker run --pull=always --rm -it --gpus all --ipc=host -p 8000:8000 \
-v $HOME/.cache/paroquant:/root/.cache/paroquant \
ghcr.io/z-lab/paroquant:serve --model z-lab/DeepSeek-R1-Distill-Llama-8B-PARO
Citation
@inproceedings{liang2026paroquant,
title = {{ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference}},
author = {Liang, Yesheng and Chen, Haisheng and Zhang, Zihan and Han, Song and Liu, Zhijian},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026}
}
- Downloads last month
- 102
Model tree for z-lab/DeepSeek-R1-Distill-Llama-8B-PARO
Base model
deepseek-ai/DeepSeek-R1-Distill-Llama-8BCollection including z-lab/DeepSeek-R1-Distill-Llama-8B-PARO
Collection
Pairwise Rotation Quantization for Efficient Reasoning LLM Inference • 16 items • Updated
• 12