Qwen3.5-27B-NVFP4

NVFP4-quantized version of Qwen/Qwen3.5-27B, quantized using NVIDIA ModelOpt for efficient serving on NVIDIA Blackwell GPUs with native FP4 tensor core acceleration.

Key Details

Property	Value
Base Model	Qwen/Qwen3.5-27B
Quantization	NVFP4 (4-bit floating point with FP8 per-block scales)
Quantization Tool	nvidia-modelopt 0.42.0
Calibration	256 samples from cnn_dailymail (3.0.0, train split)
Calibration Sequence Length	512 tokens
Model Size	~19 GB (down from ~54 GB in BF16)
Supported Hardware	NVIDIA Blackwell (SM 120+) — RTX 5090, B200, GB200, etc.
Serving Framework	vLLM 0.17.1 with `--quantization modelopt`

Architecture

Qwen3.5-27B is a multimodal Vision-Language Model (VLM) with a novel hybrid architecture:

64 total decoder layers with 5120 hidden dimension
48 Gated DeltaNet layers (linear_attention) — a linear attention variant with gated recurrence for efficient long-context processing
16 standard full attention layers (every 4th layer: 3, 7, 11, 15, ..., 63) — providing global attention capability
27-layer Vision Transformer (ViT) with 1152 hidden dimension for image understanding
262,144 token context length
Supports text, image, and video inputs
Tool calling and reasoning (thinking) capabilities

What's Quantized vs Full Precision

Quantized to NVFP4 (606 layers)

Component	Count	Description
MLP layers	192 (64×3)	`gate_proj`, `up_proj`, `down_proj` across all 64 layers
DeltaNet projections	240 (48×5)	`in_proj_a`, `in_proj_b`, `in_proj_qkv`, `in_proj_z`, `out_proj`
Full attention projections	64 (16×4)	`q_proj`, `k_proj`, `v_proj`, `o_proj`
Vision encoder linears	110	ViT attention + MLP weights

Kept at BF16 (full precision)

Component	Count	Reason
`lm_head`	1	Output projection — quantizing degrades output quality
`embed_tokens`	1	Embedding layer (`nn.Embedding`, not `nn.Linear`)
DeltaNet `conv1d` weights	48	Sensitive state-tracking convolutions in linear attention
DeltaNet `A_log` parameters	48	Decay rate parameters (not `nn.Linear`)
DeltaNet `dt_bias` parameters	48	Timestep bias parameters (not `nn.Linear`)
Vision encoder (all)	333	Kept at BF16 for image understanding quality
LayerNorm weights	209	Normalization parameters

How to Serve with vLLM

pip install vllm==0.17.1

vllm serve berkerdooo/Qwen3.5-27B-NVFP4 \
  --quantization modelopt \
  --gpu-memory-utilization 0.75 \
  --max-model-len -1 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --served-model-name qwen3.5-27b \
  --trust-remote-code

Then query the OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="qwen3.5-27b",
    messages=[{"role": "user", "content": "Explain quantum entanglement in simple terms."}],
    max_tokens=1024,
)
print(response.choices[0].message.content)

Important Notes for Serving

This model requires vLLM 0.17.1. It has been tested and verified with this version. Other versions may not work correctly.

vLLM patch required: vLLM's modelopt quantization handler has a hardcoded vision encoder prefix check for "vision_tower" and "vision_model", but Qwen3.5 uses "model.visual". You need to patch vllm/model_executor/layers/quantization/modelopt.py (~line 190):
```
# Change:
if "vision_tower" in prefix or "vision_model" in prefix:
# To:
if "vision_tower" in prefix or "vision_model" in prefix or "visual" in prefix:
```
This ensures the vision encoder loads as unquantized BF16 correctly. Without this patch, vLLM will crash with ValueError: Unsupported model when in features size is not multiple of 16.
Blackwell GPUs required: NVFP4 uses native FP4 tensor cores available only on SM 120+ (Blackwell architecture). This will not work on Ampere/Hopper GPUs.
Tensor parallelism: Use --tensor-parallel-size N to split the model across multiple GPUs if needed. At ~19 GB, the model fits on a single 32 GB Blackwell GPU, or can be split across multiple GPUs for higher throughput.

Quantization Process

This model was quantized using NVIDIA ModelOpt's Post-Training Quantization (PTQ) with activation-aware calibration:

Step 1: Load the model

from transformers import AutoModelForImageTextToText, AutoTokenizer

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3.5-27B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

Step 2: Calibrate with real data

256 text samples from CNN/DailyMail were passed through the model. During these forward passes, ModelOpt's 2140 inserted quantizers collected activation statistics (min/max ranges per block of 16 elements) to determine optimal FP4 scaling factors for both weights and input activations.

import modelopt.torch.quantization as mtq

def forward_loop(model):
    for batch in calib_batches:
        model(input_ids=batch["input_ids"].to(device),
              attention_mask=batch["attention_mask"].to(device))

model = mtq.quantize(model, mtq.NVFP4_DEFAULT_CFG, forward_loop)

Step 3: Export checkpoint

from modelopt.torch.export import export_hf_checkpoint

with torch.inference_mode():
    export_hf_checkpoint(model, dtype=torch.bfloat16, export_dir="./qwen35-27b-nvfp4")

Step 4: Fix vision encoder weights

The NVFP4 default config quantizes all nn.Linear layers including the vision encoder. For optimal image understanding quality and TP compatibility, we replaced the quantized vision weights with the original BF16 weights and added "model.visual" to the quantization ignore list.

NVFP4 Format Details

Weight encoding: 2-bit mantissa + 1-bit exponent = FP4, packed as uint8 (2 values per byte)
Block scaling: Every 16 elements share a scaling factor stored as FP8 (float8_e4m3fn)
Global scaling: Each tensor has an additional float32 global scale (weight_scale_2)
Compression: ~3× vs BF16 (19 GB vs 54 GB)

Environment

Python: 3.12
PyTorch: 2.10.0+cu128
CUDA: 12.8
transformers: 5.3.0.dev0 (from source — required for qwen3_5 model type)
nvidia-modelopt: 0.42.0
vLLM: 0.17.1
Hardware: NVIDIA Blackwell GPU(s)

License

This model inherits the Apache 2.0 license from the base Qwen3.5-27B model.

Citation

If you use this quantized model, please cite the original Qwen3.5 work and NVIDIA ModelOpt:

@misc{qwen3.5-27b-nvfp4,
  title={Qwen3.5-27B-NVFP4: NVFP4 Quantized Qwen3.5-27B},
  author={ares2324},
  year={2026},
  url={https://huggingface.co/ares2324/Qwen3.5-27B-NVFP4}
}

Downloads last month: 2,414

Safetensors

Model size

17B params

Tensor type

BF16

F8_E4M3

Model tree for berkerdooo/Qwen3.5-27B-NVFP4

Base model

Qwen/Qwen3.5-27B

Quantized

(157)

this model