Fast-dVLM (3B) — W8A8 FP8 Quantized

[Paper] [Project Page] [Code] [BF16 Base]

Introduction

This repository hosts the W8A8 FP8 quantized version of Fast_dVLM_3B, produced via SmoothQuant offline calibration.

Both the language tower and the vision encoder are quantized to FP8 (E4M3) weights with per-channel static scales and per-token dynamic FP8 activations. Combined with SGLang block-diffusion serving, this checkpoint reaches 350 TPS (6.18× over the AR baseline) while keeping MMMU-Pro-V within 0.3 of the BF16 result.

Key Highlights

6.18× Speedup over the Qwen2.5-VL-3B AR baseline (350 TPS vs 56.7 TPS).
Near-Lossless Quality: 23.8 vs 24.1 MMMU-Pro-V vs the BF16 block-diffusion model.
Full-Model FP8: Language + vision both quantized; only norms, embeddings, and lm_head remain in BF16.

Model Overview

Property	Value
Type	Block Diffusion Vision-Language Model (FP8 quantized)
Base Model	`Efficient-Large-Model/Fast_dVLM_3B`
Quantization	SmoothQuant W8A8 FP8 (E4M3), per-channel static weight / per-token dynamic activation
Calibration	512 samples × 1024 tokens on C4, SmoothQuant α=0.5
Text Layers	36
Vision Depth	32
Text Hidden Size	2048
Attention Heads	16 (Q), 2 (KV, GQA)
Block Diffusion Size	32

Hardware requirement: NVIDIA GPU with SM89+ (Compute Capability ≥ 8.9) — RTX 4090, L40, H100, H200. FP8 tensor cores are not available on A100 (SM80) or older.

Quickstart (SGLang)

Load this checkpoint with the customized SGLang shipped in the Fast-dLLM repo:

# Install the customized SGLang (one-time)
git clone https://github.com/NVlabs/Fast-dLLM
cd Fast-dLLM/fast_dvlm/sglang/python
pip install -e .

# Run the chatbot with FP8 quantization
cd ../..
python run_chatbot_sglang.py \
    --algorithm spec \
    --model-path Sensen02/Fast_dVLM_3B_W8A8_FP8 \
    --quantization w8a8_fp8 \
    --prompt "Describe this image." \
    --image path/to/image.jpg

SGLang reads the quantization_config in config.json automatically:

"quantization_config": {
    "quant_method": "w8a8_fp8",
    "is_dynamic": false,
    "ignore": []
}

The ignore list is empty because every linear layer — including the vision encoder — is FP8 quantized in this release.

Benchmark Results

Quality (MMMU-Pro-V)

Variant	MMMU-Pro-V
Qwen2.5-VL-3B (AR baseline)	26.3
Fast-dVLM (MDM, τ=0.9)	21.4
Fast-dVLM (spec.) — BF16	24.1
Fast-dVLM (spec.) — W8A8 FP8 (this repo)	23.8

Inference Acceleration

Setting	MMMU-Pro-V	TPS	SpeedUp
AR baseline	26.3	56.7	1.00x
Fast-dVLM (MDM, τ=0.9)	21.4	82.2	1.45x
+ Spec. decoding (linear)	24.6	112.7	1.98x
+ SGLang serving	24.1	319.0	5.63x
+ SmoothQuant-W8A8 (FP8)	23.8	350.3	6.18x

Citation

If you use Fast-dVLM in your research, please cite:

@misc{wu2026fastdvlmefficientblockdiffusionvlm,
      title={Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM},
      author={Chengyue Wu and Shiyi Lan and Yonggan Fu and Sensen Gao and Jin Wang and Jincheng Yu and Jose M. Alvarez and Pavlo Molchanov and Ping Luo and Song Han and Ligeng Zhu and Enze Xie},
      year={2026},
      eprint={2604.06832},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.06832},
}

License

Released under Apache 2.0, following the base Qwen2.5-VL license.

Downloads last month: 21

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Sensen02/Fast_dVLM_3B_W8A8_FP8

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

Efficient-Large-Model/Fast_dVLM_3B

Quantized

(1)

this model

Paper for Sensen02/Fast_dVLM_3B_W8A8_FP8

Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM

Paper • 2604.06832 • Published 25 days ago