Fast-dVLM (3B) β W8A8 FP8 Quantized
[Paper] [Project Page] [Code] [BF16 Base]
Introduction
This repository hosts the W8A8 FP8 quantized version of Fast_dVLM_3B, produced via SmoothQuant offline calibration.
Both the language tower and the vision encoder are quantized to FP8 (E4M3) weights with per-channel static scales and per-token dynamic FP8 activations. Combined with SGLang block-diffusion serving, this checkpoint reaches 350 TPS (6.18Γ over the AR baseline) while keeping MMMU-Pro-V within 0.3 of the BF16 result.
Key Highlights
- 6.18Γ Speedup over the Qwen2.5-VL-3B AR baseline (350 TPS vs 56.7 TPS).
- Near-Lossless Quality: 23.8 vs 24.1 MMMU-Pro-V vs the BF16 block-diffusion model.
- Full-Model FP8: Language + vision both quantized; only norms, embeddings, and
lm_headremain in BF16.
Model Overview
| Property | Value |
|---|---|
| Type | Block Diffusion Vision-Language Model (FP8 quantized) |
| Base Model | Efficient-Large-Model/Fast_dVLM_3B |
| Quantization | SmoothQuant W8A8 FP8 (E4M3), per-channel static weight / per-token dynamic activation |
| Calibration | 512 samples Γ 1024 tokens on C4, SmoothQuant Ξ±=0.5 |
| Text Layers | 36 |
| Vision Depth | 32 |
| Text Hidden Size | 2048 |
| Attention Heads | 16 (Q), 2 (KV, GQA) |
| Block Diffusion Size | 32 |
Hardware requirement: NVIDIA GPU with SM89+ (Compute Capability β₯ 8.9) β RTX 4090, L40, H100, H200. FP8 tensor cores are not available on A100 (SM80) or older.
Quickstart (SGLang)
Load this checkpoint with the customized SGLang shipped in the Fast-dLLM repo:
# Install the customized SGLang (one-time)
git clone https://github.com/NVlabs/Fast-dLLM
cd Fast-dLLM/fast_dvlm/sglang/python
pip install -e .
# Run the chatbot with FP8 quantization
cd ../..
python run_chatbot_sglang.py \
--algorithm spec \
--model-path Sensen02/Fast_dVLM_3B_W8A8_FP8 \
--quantization w8a8_fp8 \
--prompt "Describe this image." \
--image path/to/image.jpg
SGLang reads the quantization_config in config.json automatically:
"quantization_config": {
"quant_method": "w8a8_fp8",
"is_dynamic": false,
"ignore": []
}
The ignore list is empty because every linear layer β including the vision encoder β is FP8 quantized in this release.
Benchmark Results
Quality (MMMU-Pro-V)
| Variant | MMMU-Pro-V |
|---|---|
| Qwen2.5-VL-3B (AR baseline) | 26.3 |
| Fast-dVLM (MDM, Ο=0.9) | 21.4 |
| Fast-dVLM (spec.) β BF16 | 24.1 |
| Fast-dVLM (spec.) β W8A8 FP8 (this repo) | 23.8 |
Inference Acceleration
| Setting | MMMU-Pro-V | TPS | SpeedUp |
|---|---|---|---|
| AR baseline | 26.3 | 56.7 | 1.00x |
| Fast-dVLM (MDM, Ο=0.9) | 21.4 | 82.2 | 1.45x |
| + Spec. decoding (linear) | 24.6 | 112.7 | 1.98x |
| + SGLang serving | 24.1 | 319.0 | 5.63x |
| + SmoothQuant-W8A8 (FP8) | 23.8 | 350.3 | 6.18x |
Citation
If you use Fast-dVLM in your research, please cite:
@misc{wu2026fastdvlmefficientblockdiffusionvlm,
title={Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM},
author={Chengyue Wu and Shiyi Lan and Yonggan Fu and Sensen Gao and Jin Wang and Jincheng Yu and Jose M. Alvarez and Pavlo Molchanov and Ping Luo and Song Han and Ligeng Zhu and Enze Xie},
year={2026},
eprint={2604.06832},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.06832},
}
License
Released under Apache 2.0, following the base Qwen2.5-VL license.
- Downloads last month
- 21
Model tree for Sensen02/Fast_dVLM_3B_W8A8_FP8
Base model
Qwen/Qwen2.5-VL-3B-Instruct