"Qwen3-30B-A3B-NVFP4 (Blackwell Optimized - WIP)"
This repository contains a self-quantized version of Qwen/Qwen3-30B-A3B using the NVIDIA NVFP4 format. This was produced on an Asus Ascent GX10 (NVIDIA GB10 Grace Blackwell) system using the NVIDIA ModelOptimizer playbook.
๐ Hardware & Architecture
- Host System: Asus Ascent GX10 (Desktop AI Supercomputer)
- Accelerator: NVIDIA Blackwell (SM121 / GB10)
- Memory: 128GB Coherent Unified Memory (LPDDR5X)
- Format: NVFP4 (4-bit Floating Point) with two-level micro-block scaling.
โก Current Performance Status (Jan 2026)
Work in Progress: Initial testing on vLLM is ongoing. Current performance metrics (throughput/latency) are not yet meeting expected benchmarks compared to BF16 baselines. Kernel optimization is in progress.
๐ Deployment Details
Quantized using the standard NVIDIA NVFP4 playbook. This format is designed for a 3.5x memory reduction compared to BF16, specifically utilizing hardware-level acceleration (5th Gen Tensor Cores) on Blackwell silicon.
- Quantization Method: ModelOpt (NVFP4)
- Block Size: 16 (Fine-grained scaling)
- Supported Frameworks: vLLM, TensorRT-LLM (Blackwell-compatible versions)
๐ฎ RTX 5090 Compatibility
Initial verification shows compatibility with the RTX 5090 (MSI Ventus 3X OC). The ~17.7GB footprint allows for a 32GB VRAM card to maintain a large KV cache and high context window (tested with FP8 KV cache).
๐ License
- Original Weights: Qwen/Qwen3-30B-A3B is licensed under Apache 2.0.
- Quantization: This derivative work follows the same Apache 2.0 permissive terms.
- Downloads last month
- 2