Qwen3.5-27B-NVFP4

NVFP4-quantized version of Qwen/Qwen3.5-27B, quantized using NVIDIA ModelOpt for efficient serving on NVIDIA Blackwell GPUs with native FP4 tensor core acceleration.

Key Details

Property Value
Base Model Qwen/Qwen3.5-27B
Quantization NVFP4 (4-bit floating point with FP8 per-block scales)
Quantization Tool nvidia-modelopt 0.42.0
Calibration 256 samples from cnn_dailymail (3.0.0, train split)
Calibration Sequence Length 512 tokens
Model Size ~19 GB (down from ~54 GB in BF16)
Supported Hardware NVIDIA Blackwell (SM 120+) — RTX 5090, B200, GB200, etc.
Serving Framework vLLM 0.17.1 with --quantization modelopt

Architecture

Qwen3.5-27B is a multimodal Vision-Language Model (VLM) with a novel hybrid architecture:

  • 64 total decoder layers with 5120 hidden dimension
  • 48 Gated DeltaNet layers (linear_attention) — a linear attention variant with gated recurrence for efficient long-context processing
  • 16 standard full attention layers (every 4th layer: 3, 7, 11, 15, ..., 63) — providing global attention capability
  • 27-layer Vision Transformer (ViT) with 1152 hidden dimension for image understanding
  • 262,144 token context length
  • Supports text, image, and video inputs
  • Tool calling and reasoning (thinking) capabilities

What's Quantized vs Full Precision

Quantized to NVFP4 (606 layers)

Component Count Description
MLP layers 192 (64×3) gate_proj, up_proj, down_proj across all 64 layers
DeltaNet projections 240 (48×5) in_proj_a, in_proj_b, in_proj_qkv, in_proj_z, out_proj
Full attention projections 64 (16×4) q_proj, k_proj, v_proj, o_proj
Vision encoder linears 110 ViT attention + MLP weights

Kept at BF16 (full precision)

Component Count Reason
lm_head 1 Output projection — quantizing degrades output quality
embed_tokens 1 Embedding layer (nn.Embedding, not nn.Linear)
DeltaNet conv1d weights 48 Sensitive state-tracking convolutions in linear attention
DeltaNet A_log parameters 48 Decay rate parameters (not nn.Linear)
DeltaNet dt_bias parameters 48 Timestep bias parameters (not nn.Linear)
Vision encoder (all) 333 Kept at BF16 for image understanding quality
LayerNorm weights 209 Normalization parameters

How to Serve with vLLM

pip install vllm==0.17.1

vllm serve berkerdooo/Qwen3.5-27B-NVFP4 \
  --quantization modelopt \
  --gpu-memory-utilization 0.75 \
  --max-model-len -1 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --served-model-name qwen3.5-27b \
  --trust-remote-code

Then query the OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="qwen3.5-27b",
    messages=[{"role": "user", "content": "Explain quantum entanglement in simple terms."}],
    max_tokens=1024,
)
print(response.choices[0].message.content)

Important Notes for Serving

This model requires vLLM 0.17.1. It has been tested and verified with this version. Other versions may not work correctly.

  1. vLLM patch required: vLLM's modelopt quantization handler has a hardcoded vision encoder prefix check for "vision_tower" and "vision_model", but Qwen3.5 uses "model.visual". You need to patch vllm/model_executor/layers/quantization/modelopt.py (~line 190):

    # Change:
    if "vision_tower" in prefix or "vision_model" in prefix:
    # To:
    if "vision_tower" in prefix or "vision_model" in prefix or "visual" in prefix:
    

    This ensures the vision encoder loads as unquantized BF16 correctly. Without this patch, vLLM will crash with ValueError: Unsupported model when in features size is not multiple of 16.

  2. Blackwell GPUs required: NVFP4 uses native FP4 tensor cores available only on SM 120+ (Blackwell architecture). This will not work on Ampere/Hopper GPUs.

  3. Tensor parallelism: Use --tensor-parallel-size N to split the model across multiple GPUs if needed. At ~19 GB, the model fits on a single 32 GB Blackwell GPU, or can be split across multiple GPUs for higher throughput.

Quantization Process

This model was quantized using NVIDIA ModelOpt's Post-Training Quantization (PTQ) with activation-aware calibration:

Step 1: Load the model

from transformers import AutoModelForImageTextToText, AutoTokenizer

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3.5-27B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

Step 2: Calibrate with real data

256 text samples from CNN/DailyMail were passed through the model. During these forward passes, ModelOpt's 2140 inserted quantizers collected activation statistics (min/max ranges per block of 16 elements) to determine optimal FP4 scaling factors for both weights and input activations.

import modelopt.torch.quantization as mtq

def forward_loop(model):
    for batch in calib_batches:
        model(input_ids=batch["input_ids"].to(device),
              attention_mask=batch["attention_mask"].to(device))

model = mtq.quantize(model, mtq.NVFP4_DEFAULT_CFG, forward_loop)

Step 3: Export checkpoint

from modelopt.torch.export import export_hf_checkpoint

with torch.inference_mode():
    export_hf_checkpoint(model, dtype=torch.bfloat16, export_dir="./qwen35-27b-nvfp4")

Step 4: Fix vision encoder weights

The NVFP4 default config quantizes all nn.Linear layers including the vision encoder. For optimal image understanding quality and TP compatibility, we replaced the quantized vision weights with the original BF16 weights and added "model.visual" to the quantization ignore list.

NVFP4 Format Details

  • Weight encoding: 2-bit mantissa + 1-bit exponent = FP4, packed as uint8 (2 values per byte)
  • Block scaling: Every 16 elements share a scaling factor stored as FP8 (float8_e4m3fn)
  • Global scaling: Each tensor has an additional float32 global scale (weight_scale_2)
  • Compression: ~3× vs BF16 (19 GB vs 54 GB)

Environment

  • Python: 3.12
  • PyTorch: 2.10.0+cu128
  • CUDA: 12.8
  • transformers: 5.3.0.dev0 (from source — required for qwen3_5 model type)
  • nvidia-modelopt: 0.42.0
  • vLLM: 0.17.1
  • Hardware: NVIDIA Blackwell GPU(s)

License

This model inherits the Apache 2.0 license from the base Qwen3.5-27B model.

Citation

If you use this quantized model, please cite the original Qwen3.5 work and NVIDIA ModelOpt:

@misc{qwen3.5-27b-nvfp4,
  title={Qwen3.5-27B-NVFP4: NVFP4 Quantized Qwen3.5-27B},
  author={ares2324},
  year={2026},
  url={https://huggingface.co/ares2324/Qwen3.5-27B-NVFP4}
}
Downloads last month
2,414
Safetensors
Model size
17B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for berkerdooo/Qwen3.5-27B-NVFP4

Base model

Qwen/Qwen3.5-27B
Quantized
(157)
this model