Gemma-4-12B-IT NVFP4 (ModelOpt)

NVFP4 post-training quantization of google/gemma-4-12B-it, produced with NVIDIA TensorRT Model Optimizer (nvidia-modelopt 0.44.0).

NVIDIA published official NVFP4 checkpoints for Gemma-4-31B and the 26B-A4B MoE, but not for the dense 12B. This checkpoint fills that gap, following the exact recipe NVIDIA used for nvidia/Gemma-4-31B-IT-NVFP4.

Recipe

  • Format: NVFP4 (E2M1 weights + activations, FP8-E4M3 block scale over 16-element groups), quant_algo: NVFP4.
  • Scope: the language-model MLP projections only (gate_proj / up_proj / down_proj across all 48 decoder layers = 144 linears). Attention projections, the vision and audio towers, token embeddings and lm_head are kept in BF16 — identical to NVIDIA's ignore list for the 31B.
  • Calibration: 512 samples of cnn_dailymail (3.0.0), algorithm: max.
  • Hardware: single NVIDIA RTX 5090 (Blackwell, sm_120). The 12B fits PTQ on one 32 GB GPU — no tensor-parallel needed.

The output is a standard ModelOpt compressed-tensors-style checkpoint (quant_method: modelopt) consumable by vLLM / SGLang / TensorRT-LLM, and by llama.cpp's convert_hf_to_gguf.py.

GGUF builds

Ready-to-run llama.cpp GGUFs (NVFP4 FFN + K-quant elsewhere) are at LibertAIDAI/Gemma-4-12B-IT-NVFP4-GGUF.

About LibertAI

LibertAI is a decentralized AI platform — private inference, an OpenAI-compatible API, and a chat UI, all running on community GPUs over Aleph Cloud. If you want to run agents without managing infrastructure, see LiberClaw.

License

Inherits the Gemma license from the upstream model.

Downloads last month
1,284
Safetensors
Model size
8B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LibertAIDAI/Gemma-4-12B-IT-NVFP4

Quantized
(104)
this model