"Qwen3-30B-A3B-NVFP4 (Blackwell Optimized - WIP)"

This repository contains a self-quantized version of Qwen/Qwen3-30B-A3B using the NVIDIA NVFP4 format. This was produced on an Asus Ascent GX10 (NVIDIA GB10 Grace Blackwell) system using the NVIDIA ModelOptimizer playbook.

🏗 Hardware & Architecture

Host System: Asus Ascent GX10 (Desktop AI Supercomputer)
Accelerator: NVIDIA Blackwell (SM121 / GB10)
Memory: 128GB Coherent Unified Memory (LPDDR5X)
Format: NVFP4 (4-bit Floating Point) with two-level micro-block scaling.

⚡ Current Performance Status (Jan 2026)

Work in Progress: Initial testing on vLLM is ongoing. Current performance metrics (throughput/latency) are not yet meeting expected benchmarks compared to BF16 baselines. Kernel optimization is in progress.

🚀 Deployment Details

Quantized using the standard NVIDIA NVFP4 playbook. This format is designed for a 3.5x memory reduction compared to BF16, specifically utilizing hardware-level acceleration (5th Gen Tensor Cores) on Blackwell silicon.

Quantization Method: ModelOpt (NVFP4)
Block Size: 16 (Fine-grained scaling)
Supported Frameworks: vLLM, TensorRT-LLM (Blackwell-compatible versions)

🎮 RTX 5090 Compatibility

Initial verification shows compatibility with the RTX 5090 (MSI Ventus 3X OC). The ~17.7GB footprint allows for a 32GB VRAM card to maintain a large KV cache and high context window (tested with FP8 KV cache).

📜 License

Original Weights: Qwen/Qwen3-30B-A3B is licensed under Apache 2.0.
Quantization: This derivative work follows the same Apache 2.0 permissive terms.

Downloads last month: 2

Safetensors

Model size

16B params

Tensor type

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vipertsniper/Qwen3-30B-A3B-NVFP4

Base model

Qwen/Qwen3-30B-A3B-Base

Finetuned

Qwen/Qwen3-30B-A3B

Quantized

(109)

this model