Qwen3.5-27B for hipfire
Pre-quantized Qwen3.5-27B (DeltaNet hybrid) for hipfire, a Rust-native LLM inference engine for AMD RDNA GPUs.
Quantized from Qwen/Qwen3.5-27B.
Files
| File | Quant | Size | Min VRAM | RX 5700 XT | RX 7900 XTX |
|---|---|---|---|---|---|
| qwen3.5-27b.hf4 | HF4 | 13.32 GB | 16 GB | TBD | 47 tok/s |
| qwen3.5-27b.hf6 | HF6 | 19.92 GB | 24 GB | TBD | — |
| qwen3.5-27b.mq4 | MQ4 ⭐ | 13.95 GB | 16 GB | TBD | 46 tok/s |
Speeds are forward-only tok/s on the listed AMD GPU. ⭐ MQ4 ships with a mandatory byte-exact greedy quality gate (9 reference token streams).
Usage
# Install hipfire
curl -L https://raw.githubusercontent.com/Kaden-Schutt/hipfire/master/scripts/install.sh | bash
# Pull and run any variant
hipfire pull qwen3.5:27b # HF4 (default — fastest)
hipfire pull qwen3.5:27b-mq4 # MQ4 (quality-gated, near-Q8 output)
hipfire pull qwen3.5:27b-hf6 # HF6 (highest quality, ~15% slower)
hipfire run qwen3.5:27b-mq4 "Hello"
Quantization Formats
HF4 (HFQ4-G256) — flat 4-bit, 256-weight groups (~0.53 B/w including per-group scale + zero). Best raw tok/s. Same storage layout as Q4_K_M in llama.cpp but without the K-quant block descriptors.
HF6 (HFQ6-G256) — flat 6-bit, 256-weight groups (~0.78 B/w). Highest quality, ~15% slower than HF4. Use this if you have VRAM headroom and want the smallest accuracy loss vs FP16.
MQ4 (MagnumQuant 4-bit) ⭐ — FWHT-rotated 4-bit. Storage layout identical to HF4 (4.25 B/w), but the weights are pre-rotated through a Walsh–Hadamard transform at quantization time, and the input
xvector is rotated through the same transform on the fly during the GEMV. The rotation flattens outliers, dramatically improving the quantization-error distribution. Result: roughly Q8-grade output quality at Q4 bandwidth.Every commit that touches kernel or forward-pass code in the hipfire repo is gated against MQ4 byte-exact greedy decoding of 9 reference (model, prompt) pairs — see tests/quality-baselines and
scripts/quality-gate.sh. Any silent numerical regression in the forward pass is caught at commit time.
All formats embed the tokenizer and model config inside the model file —
no separate tokenizer.json download needed.
About hipfire
Rust + HIP inference engine for AMD consumer GPUs (RDNA1–RDNA4). No Python in the hot path. The 0.1.4-alpha branch lands a kernel-fusion overhaul that roughly doubles forward speed on Qwen3.5 across the lineup vs the previous release.
- GitHub: Kaden-Schutt/hipfire
- All models: docs/MODELS.md
License
Model weights subject to the original Qwen license. hipfire engine: MIT.
Model tree for schuttdev/hipfire-qwen3.5-27b
Base model
Qwen/Qwen3.5-27B