worthdoing

Author: Simon-Pierre Boucher

GGUF Parameters Apple Silicon License worthdoing

Q4_K_M Q5_K_M Q8_0

DeepSeek-R1-Distill-Qwen-7B - GGUF Quantized by worthdoing

Quantized for local Mac inference (Apple Silicon / Metal) by worthdoing

About

This is a GGUF quantized version of DeepSeek-R1-Distill-Qwen-7B, optimized for running locally on Apple Silicon Macs with llama.cpp, Ollama, or LM Studio.

Description

DeepSeek R1 reasoning distilled into Qwen 7B. Chain-of-thought powerhouse.

Available Quantizations

File Quant BPW Size Use Case
deepseek-r1-distill-qwen-7b-Q4_K_M-worthdoing.gguf Q4_K_M 4.58 ~3.7 GB Recommended - Best quality/size ratio
deepseek-r1-distill-qwen-7b-Q5_K_M-worthdoing.gguf Q5_K_M 5.33 ~4.3 GB Higher quality, still fast
deepseek-r1-distill-qwen-7b-Q8_0-worthdoing.gguf Q8_0 7.96 ~6.5 GB Near-original quality

How to Use

With Ollama

# Create a Modelfile
cat > Modelfile <<'MODELEOF'
FROM ./deepseek-r1-distill-qwen-7b-Q4_K_M-worthdoing.gguf
MODELEOF

ollama create deepseek-r1-distill-qwen-7b -f Modelfile
ollama run deepseek-r1-distill-qwen-7b

With llama.cpp

llama-cli -m deepseek-r1-distill-qwen-7b-Q4_K_M-worthdoing.gguf -p "Your prompt here" -ngl 99

With LM Studio

  1. Download the GGUF file
  2. Open LM Studio -> My Models -> Import
  3. Select the GGUF file and start chatting

Quantization Method

Our quantization pipeline (corelm-model v1.0) follows a rigorous multi-step process to ensure maximum quality and compatibility:

Step 1 — Download & Validation

  • Model weights are downloaded from HuggingFace Hub in SafeTensors format (.safetensors)
  • Legacy formats (.bin, .pt) are excluded to ensure clean, verified weights
  • Tokenizer, configuration, and all metadata are preserved

Step 2 — Conversion to GGUF F16 Baseline

  • The original model is converted to GGUF format at FP16 precision using convert_hf_to_gguf.py from llama.cpp
  • This lossless baseline preserves the full original model quality
  • Architecture-specific tensors (attention, FFN, embeddings, MoE routing) are mapped to their GGUF equivalents

Step 3 — K-Quant Quantization

  • The F16 baseline is quantized using llama-quantize with k-quant methods
  • K-quants use a mixed-precision approach: more important layers (attention, output) retain higher precision, while less sensitive layers (FFN) are compressed more aggressively
  • Each quantization level offers a different quality/size tradeoff:
Method Bits per Weight Strategy
Q4_K_M ~4.58 bpw Mixed 4/5-bit. Attention & output layers use Q5_K, FFN layers use Q4_K. Best balance of quality and size.
Q5_K_M ~5.33 bpw Mixed 5/6-bit. Attention & output layers use Q6_K, FFN layers use Q5_K. Higher quality with moderate size increase.
Q8_0 ~7.96 bpw Uniform 8-bit. All layers quantized to 8-bit. Near-lossless quality, largest file size.

Step 4 — Metadata Injection

  • Custom metadata is embedded directly in each GGUF file:
    • general.quantized_by: worthdoing
    • general.quantization_version: corelm-1.0
  • This ensures full traceability and provenance of every quantized file

Tools & Environment

  • llama.cpp: Used for both conversion and quantization — the industry-standard open-source LLM inference engine
  • Target platform: Apple Silicon Macs (M1/M2/M3/M4) with Metal GPU acceleration
  • Inference runtimes: Compatible with llama.cpp, Ollama, LM Studio, koboldcpp, and any GGUF-compatible runtime

Recommended Hardware

Quant Min RAM Recommended
Q4_K_M 4 GB Mac with 8 GB+ RAM
Q5_K_M 5 GB Mac with 8 GB+ RAM
Q8_0 8 GB Mac with 12 GB+ RAM

Tags

reasoning, math, coding, chain-of-thought


Quantized with corelm-model pipeline by worthdoing on 2026-04-17

Downloads last month
996
GGUF
Model size
8B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for worthdoing/DeepSeek-R1-Distill-Qwen-7B-GGUF

Quantized
(173)
this model