Fast-dVLM (3B) β€” W8A8 FP8 Quantized

[Paper] [Project Page] [Code] [BF16 Base]

Introduction

This repository hosts the W8A8 FP8 quantized version of Fast_dVLM_3B, produced via SmoothQuant offline calibration.

Both the language tower and the vision encoder are quantized to FP8 (E4M3) weights with per-channel static scales and per-token dynamic FP8 activations. Combined with SGLang block-diffusion serving, this checkpoint reaches 350 TPS (6.18Γ— over the AR baseline) while keeping MMMU-Pro-V within 0.3 of the BF16 result.

Key Highlights

  • 6.18Γ— Speedup over the Qwen2.5-VL-3B AR baseline (350 TPS vs 56.7 TPS).
  • Near-Lossless Quality: 23.8 vs 24.1 MMMU-Pro-V vs the BF16 block-diffusion model.
  • Full-Model FP8: Language + vision both quantized; only norms, embeddings, and lm_head remain in BF16.

Model Overview

Property Value
Type Block Diffusion Vision-Language Model (FP8 quantized)
Base Model Efficient-Large-Model/Fast_dVLM_3B
Quantization SmoothQuant W8A8 FP8 (E4M3), per-channel static weight / per-token dynamic activation
Calibration 512 samples Γ— 1024 tokens on C4, SmoothQuant Ξ±=0.5
Text Layers 36
Vision Depth 32
Text Hidden Size 2048
Attention Heads 16 (Q), 2 (KV, GQA)
Block Diffusion Size 32

Hardware requirement: NVIDIA GPU with SM89+ (Compute Capability β‰₯ 8.9) β€” RTX 4090, L40, H100, H200. FP8 tensor cores are not available on A100 (SM80) or older.


Quickstart (SGLang)

Load this checkpoint with the customized SGLang shipped in the Fast-dLLM repo:

# Install the customized SGLang (one-time)
git clone https://github.com/NVlabs/Fast-dLLM
cd Fast-dLLM/fast_dvlm/sglang/python
pip install -e .

# Run the chatbot with FP8 quantization
cd ../..
python run_chatbot_sglang.py \
    --algorithm spec \
    --model-path Sensen02/Fast_dVLM_3B_W8A8_FP8 \
    --quantization w8a8_fp8 \
    --prompt "Describe this image." \
    --image path/to/image.jpg

SGLang reads the quantization_config in config.json automatically:

"quantization_config": {
    "quant_method": "w8a8_fp8",
    "is_dynamic": false,
    "ignore": []
}

The ignore list is empty because every linear layer β€” including the vision encoder β€” is FP8 quantized in this release.


Benchmark Results

Quality (MMMU-Pro-V)

Variant MMMU-Pro-V
Qwen2.5-VL-3B (AR baseline) 26.3
Fast-dVLM (MDM, Ο„=0.9) 21.4
Fast-dVLM (spec.) β€” BF16 24.1
Fast-dVLM (spec.) β€” W8A8 FP8 (this repo) 23.8

Inference Acceleration

Setting MMMU-Pro-V TPS SpeedUp
AR baseline 26.3 56.7 1.00x
Fast-dVLM (MDM, Ο„=0.9) 21.4 82.2 1.45x
+ Spec. decoding (linear) 24.6 112.7 1.98x
+ SGLang serving 24.1 319.0 5.63x
+ SmoothQuant-W8A8 (FP8) 23.8 350.3 6.18x

Citation

If you use Fast-dVLM in your research, please cite:

@misc{wu2026fastdvlmefficientblockdiffusionvlm,
      title={Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM},
      author={Chengyue Wu and Shiyi Lan and Yonggan Fu and Sensen Gao and Jin Wang and Jincheng Yu and Jose M. Alvarez and Pavlo Molchanov and Ping Luo and Song Han and Ligeng Zhu and Enze Xie},
      year={2026},
      eprint={2604.06832},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.06832},
}

License

Released under Apache 2.0, following the base Qwen2.5-VL license.

Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Sensen02/Fast_dVLM_3B_W8A8_FP8

Quantized
(1)
this model

Paper for Sensen02/Fast_dVLM_3B_W8A8_FP8