Author: Simon-Pierre Boucher
DeepSeek-R1-Distill-Qwen-7B - GGUF Quantized by worthdoing
Quantized for local Mac inference (Apple Silicon / Metal) by worthdoing
About
This is a GGUF quantized version of DeepSeek-R1-Distill-Qwen-7B, optimized for running locally on Apple Silicon Macs with llama.cpp, Ollama, or LM Studio.
- Original model: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
- Parameters: 7B
- Quantized by: worthdoing
- Pipeline: corelm-model v1.0
Description
DeepSeek R1 reasoning distilled into Qwen 7B. Chain-of-thought powerhouse.
Available Quantizations
| File | Quant | BPW | Size | Use Case |
|---|---|---|---|---|
deepseek-r1-distill-qwen-7b-Q4_K_M-worthdoing.gguf |
Q4_K_M | 4.58 | ~3.7 GB | Recommended - Best quality/size ratio |
deepseek-r1-distill-qwen-7b-Q5_K_M-worthdoing.gguf |
Q5_K_M | 5.33 | ~4.3 GB | Higher quality, still fast |
deepseek-r1-distill-qwen-7b-Q8_0-worthdoing.gguf |
Q8_0 | 7.96 | ~6.5 GB | Near-original quality |
How to Use
With Ollama
# Create a Modelfile
cat > Modelfile <<'MODELEOF'
FROM ./deepseek-r1-distill-qwen-7b-Q4_K_M-worthdoing.gguf
MODELEOF
ollama create deepseek-r1-distill-qwen-7b -f Modelfile
ollama run deepseek-r1-distill-qwen-7b
With llama.cpp
llama-cli -m deepseek-r1-distill-qwen-7b-Q4_K_M-worthdoing.gguf -p "Your prompt here" -ngl 99
With LM Studio
- Download the GGUF file
- Open LM Studio -> My Models -> Import
- Select the GGUF file and start chatting
Quantization Method
Our quantization pipeline (corelm-model v1.0) follows a rigorous multi-step process to ensure maximum quality and compatibility:
Step 1 — Download & Validation
- Model weights are downloaded from HuggingFace Hub in SafeTensors format (
.safetensors) - Legacy formats (
.bin,.pt) are excluded to ensure clean, verified weights - Tokenizer, configuration, and all metadata are preserved
Step 2 — Conversion to GGUF F16 Baseline
- The original model is converted to GGUF format at FP16 precision using
convert_hf_to_gguf.pyfrom llama.cpp - This lossless baseline preserves the full original model quality
- Architecture-specific tensors (attention, FFN, embeddings, MoE routing) are mapped to their GGUF equivalents
Step 3 — K-Quant Quantization
- The F16 baseline is quantized using
llama-quantizewith k-quant methods - K-quants use a mixed-precision approach: more important layers (attention, output) retain higher precision, while less sensitive layers (FFN) are compressed more aggressively
- Each quantization level offers a different quality/size tradeoff:
| Method | Bits per Weight | Strategy |
|---|---|---|
| Q4_K_M | ~4.58 bpw | Mixed 4/5-bit. Attention & output layers use Q5_K, FFN layers use Q4_K. Best balance of quality and size. |
| Q5_K_M | ~5.33 bpw | Mixed 5/6-bit. Attention & output layers use Q6_K, FFN layers use Q5_K. Higher quality with moderate size increase. |
| Q8_0 | ~7.96 bpw | Uniform 8-bit. All layers quantized to 8-bit. Near-lossless quality, largest file size. |
Step 4 — Metadata Injection
- Custom metadata is embedded directly in each GGUF file:
general.quantized_by: worthdoinggeneral.quantization_version: corelm-1.0
- This ensures full traceability and provenance of every quantized file
Tools & Environment
- llama.cpp: Used for both conversion and quantization — the industry-standard open-source LLM inference engine
- Target platform: Apple Silicon Macs (M1/M2/M3/M4) with Metal GPU acceleration
- Inference runtimes: Compatible with
llama.cpp,Ollama,LM Studio,koboldcpp, and any GGUF-compatible runtime
Recommended Hardware
| Quant | Min RAM | Recommended |
|---|---|---|
| Q4_K_M | 4 GB | Mac with 8 GB+ RAM |
| Q5_K_M | 5 GB | Mac with 8 GB+ RAM |
| Q8_0 | 8 GB | Mac with 12 GB+ RAM |
Tags
reasoning, math, coding, chain-of-thought
Quantized with corelm-model pipeline by worthdoing on 2026-04-17
- Downloads last month
- 996
Hardware compatibility
Log In to add your hardware
3-bit
4-bit
5-bit
6-bit
8-bit
Model tree for worthdoing/DeepSeek-R1-Distill-Qwen-7B-GGUF
Base model
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B