Qwen3.5-27B-NVFP4
NVFP4-quantized version of Qwen/Qwen3.5-27B, quantized using NVIDIA ModelOpt for efficient serving on NVIDIA Blackwell GPUs with native FP4 tensor core acceleration.
Key Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3.5-27B |
| Quantization | NVFP4 (4-bit floating point with FP8 per-block scales) |
| Quantization Tool | nvidia-modelopt 0.42.0 |
| Calibration | 256 samples from cnn_dailymail (3.0.0, train split) |
| Calibration Sequence Length | 512 tokens |
| Model Size | ~19 GB (down from ~54 GB in BF16) |
| Supported Hardware | NVIDIA Blackwell (SM 120+) — RTX 5090, B200, GB200, etc. |
| Serving Framework | vLLM 0.17.1 with --quantization modelopt |
Architecture
Qwen3.5-27B is a multimodal Vision-Language Model (VLM) with a novel hybrid architecture:
- 64 total decoder layers with 5120 hidden dimension
- 48 Gated DeltaNet layers (
linear_attention) — a linear attention variant with gated recurrence for efficient long-context processing - 16 standard full attention layers (every 4th layer: 3, 7, 11, 15, ..., 63) — providing global attention capability
- 27-layer Vision Transformer (ViT) with 1152 hidden dimension for image understanding
- 262,144 token context length
- Supports text, image, and video inputs
- Tool calling and reasoning (thinking) capabilities
What's Quantized vs Full Precision
Quantized to NVFP4 (606 layers)
| Component | Count | Description |
|---|---|---|
| MLP layers | 192 (64×3) | gate_proj, up_proj, down_proj across all 64 layers |
| DeltaNet projections | 240 (48×5) | in_proj_a, in_proj_b, in_proj_qkv, in_proj_z, out_proj |
| Full attention projections | 64 (16×4) | q_proj, k_proj, v_proj, o_proj |
| Vision encoder linears | 110 | ViT attention + MLP weights |
Kept at BF16 (full precision)
| Component | Count | Reason |
|---|---|---|
lm_head |
1 | Output projection — quantizing degrades output quality |
embed_tokens |
1 | Embedding layer (nn.Embedding, not nn.Linear) |
DeltaNet conv1d weights |
48 | Sensitive state-tracking convolutions in linear attention |
DeltaNet A_log parameters |
48 | Decay rate parameters (not nn.Linear) |
DeltaNet dt_bias parameters |
48 | Timestep bias parameters (not nn.Linear) |
| Vision encoder (all) | 333 | Kept at BF16 for image understanding quality |
| LayerNorm weights | 209 | Normalization parameters |
How to Serve with vLLM
pip install vllm==0.17.1
vllm serve berkerdooo/Qwen3.5-27B-NVFP4 \
--quantization modelopt \
--gpu-memory-utilization 0.75 \
--max-model-len -1 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--served-model-name qwen3.5-27b \
--trust-remote-code
Then query the OpenAI-compatible API:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="qwen3.5-27b",
messages=[{"role": "user", "content": "Explain quantum entanglement in simple terms."}],
max_tokens=1024,
)
print(response.choices[0].message.content)
Important Notes for Serving
This model requires vLLM 0.17.1. It has been tested and verified with this version. Other versions may not work correctly.
vLLM patch required: vLLM's modelopt quantization handler has a hardcoded vision encoder prefix check for
"vision_tower"and"vision_model", but Qwen3.5 uses"model.visual". You need to patchvllm/model_executor/layers/quantization/modelopt.py(~line 190):# Change: if "vision_tower" in prefix or "vision_model" in prefix: # To: if "vision_tower" in prefix or "vision_model" in prefix or "visual" in prefix:This ensures the vision encoder loads as unquantized BF16 correctly. Without this patch, vLLM will crash with
ValueError: Unsupported model when in features size is not multiple of 16.Blackwell GPUs required: NVFP4 uses native FP4 tensor cores available only on SM 120+ (Blackwell architecture). This will not work on Ampere/Hopper GPUs.
Tensor parallelism: Use
--tensor-parallel-size Nto split the model across multiple GPUs if needed. At ~19 GB, the model fits on a single 32 GB Blackwell GPU, or can be split across multiple GPUs for higher throughput.
Quantization Process
This model was quantized using NVIDIA ModelOpt's Post-Training Quantization (PTQ) with activation-aware calibration:
Step 1: Load the model
from transformers import AutoModelForImageTextToText, AutoTokenizer
model = AutoModelForImageTextToText.from_pretrained(
"Qwen/Qwen3.5-27B",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
Step 2: Calibrate with real data
256 text samples from CNN/DailyMail were passed through the model. During these forward passes, ModelOpt's 2140 inserted quantizers collected activation statistics (min/max ranges per block of 16 elements) to determine optimal FP4 scaling factors for both weights and input activations.
import modelopt.torch.quantization as mtq
def forward_loop(model):
for batch in calib_batches:
model(input_ids=batch["input_ids"].to(device),
attention_mask=batch["attention_mask"].to(device))
model = mtq.quantize(model, mtq.NVFP4_DEFAULT_CFG, forward_loop)
Step 3: Export checkpoint
from modelopt.torch.export import export_hf_checkpoint
with torch.inference_mode():
export_hf_checkpoint(model, dtype=torch.bfloat16, export_dir="./qwen35-27b-nvfp4")
Step 4: Fix vision encoder weights
The NVFP4 default config quantizes all nn.Linear layers including the vision encoder. For optimal image understanding quality and TP compatibility, we replaced the quantized vision weights with the original BF16 weights and added "model.visual" to the quantization ignore list.
NVFP4 Format Details
- Weight encoding: 2-bit mantissa + 1-bit exponent = FP4, packed as
uint8(2 values per byte) - Block scaling: Every 16 elements share a scaling factor stored as FP8 (
float8_e4m3fn) - Global scaling: Each tensor has an additional
float32global scale (weight_scale_2) - Compression: ~3× vs BF16 (19 GB vs 54 GB)
Environment
- Python: 3.12
- PyTorch: 2.10.0+cu128
- CUDA: 12.8
- transformers: 5.3.0.dev0 (from source — required for
qwen3_5model type) - nvidia-modelopt: 0.42.0
- vLLM: 0.17.1
- Hardware: NVIDIA Blackwell GPU(s)
License
This model inherits the Apache 2.0 license from the base Qwen3.5-27B model.
Citation
If you use this quantized model, please cite the original Qwen3.5 work and NVIDIA ModelOpt:
@misc{qwen3.5-27b-nvfp4,
title={Qwen3.5-27B-NVFP4: NVFP4 Quantized Qwen3.5-27B},
author={ares2324},
year={2026},
url={https://huggingface.co/ares2324/Qwen3.5-27B-NVFP4}
}
- Downloads last month
- 2,414
Model tree for berkerdooo/Qwen3.5-27B-NVFP4
Base model
Qwen/Qwen3.5-27B