Qwen3.5-27B for hipfire

Pre-quantized Qwen3.5-27B (DeltaNet hybrid) for hipfire, a Rust-native LLM inference engine for AMD RDNA GPUs.

Quantized from Qwen/Qwen3.5-27B.

Files

File Quant Size Min VRAM RX 5700 XT RX 7900 XTX
qwen3.5-27b.hf4 HF4 13.32 GB 16 GB TBD 47 tok/s
qwen3.5-27b.hf6 HF6 19.92 GB 24 GB TBD
qwen3.5-27b.mq4 MQ4 ⭐ 13.95 GB 16 GB TBD 46 tok/s

Speeds are forward-only tok/s on the listed AMD GPU. ⭐ MQ4 ships with a mandatory byte-exact greedy quality gate (9 reference token streams).

Usage

# Install hipfire
curl -L https://raw.githubusercontent.com/Kaden-Schutt/hipfire/master/scripts/install.sh | bash

# Pull and run any variant
hipfire pull qwen3.5:27b            # HF4 (default — fastest)
hipfire pull qwen3.5:27b-mq4        # MQ4 (quality-gated, near-Q8 output)
hipfire pull qwen3.5:27b-hf6        # HF6 (highest quality, ~15% slower)

hipfire run qwen3.5:27b-mq4 "Hello"

Quantization Formats

  • HF4 (HFQ4-G256) — flat 4-bit, 256-weight groups (~0.53 B/w including per-group scale + zero). Best raw tok/s. Same storage layout as Q4_K_M in llama.cpp but without the K-quant block descriptors.

  • HF6 (HFQ6-G256) — flat 6-bit, 256-weight groups (~0.78 B/w). Highest quality, ~15% slower than HF4. Use this if you have VRAM headroom and want the smallest accuracy loss vs FP16.

  • MQ4 (MagnumQuant 4-bit) ⭐ — FWHT-rotated 4-bit. Storage layout identical to HF4 (4.25 B/w), but the weights are pre-rotated through a Walsh–Hadamard transform at quantization time, and the input x vector is rotated through the same transform on the fly during the GEMV. The rotation flattens outliers, dramatically improving the quantization-error distribution. Result: roughly Q8-grade output quality at Q4 bandwidth.

    Every commit that touches kernel or forward-pass code in the hipfire repo is gated against MQ4 byte-exact greedy decoding of 9 reference (model, prompt) pairs — see tests/quality-baselines and scripts/quality-gate.sh. Any silent numerical regression in the forward pass is caught at commit time.

All formats embed the tokenizer and model config inside the model file — no separate tokenizer.json download needed.

About hipfire

Rust + HIP inference engine for AMD consumer GPUs (RDNA1–RDNA4). No Python in the hot path. The 0.1.4-alpha branch lands a kernel-fusion overhaul that roughly doubles forward speed on Qwen3.5 across the lineup vs the previous release.

License

Model weights subject to the original Qwen license. hipfire engine: MIT.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for schuttdev/hipfire-qwen3.5-27b

Base model

Qwen/Qwen3.5-27B
Finetuned
(241)
this model