Qwen3-8B — SWAN Mixed-Precision (6-bit avg)

This is Qwen3-8B quantized using SWAN (Statistical Weight Analysis for N-bit allocation) — a data-free per-tensor mixed-precision quantization method for MLX on Apple Silicon.

Key Features

Data-free quantization: No calibration dataset required — uses weight statistics only
Per-tensor bit allocation: Each tensor gets 2, 4, 8, or 16-bit based on sensitivity analysis
MLX native: Ready for inference on Apple Silicon via mlx_lm

Results

Metric	BF16	SWAN (this model)	Uniform 4-bit	SWAN Δ vs BF16
PPL (WikiText-2)	9.727	10.097	10.249	+3.8%
ARC-Challenge (25-shot)	44.62%	43.43%	42.83%	-1.2 pp
HellaSwag (10-shot)	60.04%	58.16%	58.14%	-1.9 pp
Model size	15.3 GB	6.1 GB	4.1 GB	2.5x compression

Usage

pip install mlx-lm

# Generate text
python -m mlx_lm.generate \
    --model baa-ai/Qwen3-8B-SWAN-6bit \
    --prompt "Hello, how are you?"

# Interactive chat
python -m mlx_lm.chat --model baa-ai/Qwen3-8B-SWAN-6bit

Quantization Details

Method: SWAN v3 (hybrid normalization + optimized thresholds)
Average bits: ~5.82 bits per parameter
Base precision: 4-bit with selective 8-bit for sensitive layers
Sensitive layers: Early MLP layers, select attention projections
Hardware: Quantized on Apple M2 Ultra 192GB

About SWAN

SWAN computes four sensitivity metrics per tensor: SVD spectral concentration, excess kurtosis, output noise amplification, and reconstruction error proxy. These are combined into a composite score that drives automatic bit-width allocation — without any calibration data.

Paper: SWAN: Data-Free Mixed-Precision Quantization for LLMs via Multi-Metric Sensitivity Analysis (Black Sheep AI Research, 2026)

Downloads last month: 34

Safetensors

Model size

8B params

Tensor type

F16

U32

MLX

Hardware compatibility

4-bit

Model tree for baa-ai/Qwen3-8B-SWAN-6bit-MLX

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Quantized

(234)

this model

baa-ai
/

Qwen3-8B-SWAN-6bit-MLX