"Qwen3-30B-A3B-NVFP4 (Blackwell Optimized - WIP)"

This repository contains a self-quantized version of Qwen/Qwen3-30B-A3B using the NVIDIA NVFP4 format. This was produced on an Asus Ascent GX10 (NVIDIA GB10 Grace Blackwell) system using the NVIDIA ModelOptimizer playbook.

๐Ÿ— Hardware & Architecture

  • Host System: Asus Ascent GX10 (Desktop AI Supercomputer)
  • Accelerator: NVIDIA Blackwell (SM121 / GB10)
  • Memory: 128GB Coherent Unified Memory (LPDDR5X)
  • Format: NVFP4 (4-bit Floating Point) with two-level micro-block scaling.

โšก Current Performance Status (Jan 2026)

Work in Progress: Initial testing on vLLM is ongoing. Current performance metrics (throughput/latency) are not yet meeting expected benchmarks compared to BF16 baselines. Kernel optimization is in progress.


๐Ÿš€ Deployment Details

Quantized using the standard NVIDIA NVFP4 playbook. This format is designed for a 3.5x memory reduction compared to BF16, specifically utilizing hardware-level acceleration (5th Gen Tensor Cores) on Blackwell silicon.

  • Quantization Method: ModelOpt (NVFP4)
  • Block Size: 16 (Fine-grained scaling)
  • Supported Frameworks: vLLM, TensorRT-LLM (Blackwell-compatible versions)

๐ŸŽฎ RTX 5090 Compatibility

Initial verification shows compatibility with the RTX 5090 (MSI Ventus 3X OC). The ~17.7GB footprint allows for a 32GB VRAM card to maintain a large KV cache and high context window (tested with FP8 KV cache).

๐Ÿ“œ License

  • Original Weights: Qwen/Qwen3-30B-A3B is licensed under Apache 2.0.
  • Quantization: This derivative work follows the same Apache 2.0 permissive terms.
Downloads last month
2
Safetensors
Model size
16B params
Tensor type
BF16
ยท
F8_E4M3
ยท
U8
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for vipertsniper/Qwen3-30B-A3B-NVFP4

Quantized
(109)
this model