DeepSeek-V2-Lite-NVFP4

NVFP4 (W4A4) quantized version of deepseek-ai/DeepSeek-V2-Lite, quantized using llm-compressor.

Model Details

Base model deepseek-ai/DeepSeek-V2-Lite (15.7B params)
Architecture DeepseekV2ForCausalLM (MLA attention + MoE)
Quantization NVFP4 — 4-bit floating point weights and activations
Format compressed-tensors (nvfp4-pack-quantized)
Size ~8.9 GB (3.5x compression from BF16)
Group size 16
Scale dtype float8_e4m3fn

Quantization Details

  • Method: Post-training quantization (PTQ) via llm-compressor oneshot
  • Scheme: NVFP4 — weights and input activations quantized to 4-bit float
  • Calibration: 20 samples from HuggingFaceFW/fineweb-edu (sample-10BT)
  • Ignored layers: lm_head (kept in original precision)
  • Scales: per-tensor global scale (FP32) + per-group local scale (FP8, group size 16)

Usage with vLLM

Requires a GPU with NVFP4 tensor core support (NVIDIA Blackwell, SM100+).

vllm serve carlyou/DeepSeek-V2-Lite-NVFP4 \
    --trust-remote-code \
    --max-model-len 2048
from vllm import LLM, SamplingParams

llm = LLM(
    model="carlyou/DeepSeek-V2-Lite-NVFP4",
    trust_remote_code=True,
    max_model_len=2048,
)

output = llm.generate("Hello, world!", SamplingParams(max_tokens=128))
print(output[0].outputs[0].text)

Intended Use

This model is primarily intended for benchmarking and testing NVFP4 quantization support in vLLM, particularly MLA attention + quantization fusion patterns on Blackwell GPUs.

Limitations

  • Requires Blackwell GPU (SM100+) for FP4 tensor core acceleration
  • Quantization may degrade output quality compared to FP8 or BF16 versions
  • Not evaluated on standard benchmarks — use for testing/benchmarking only
Downloads last month
346
Safetensors
Model size
9B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for carlyou/DeepSeek-V2-Lite-NVFP4

Quantized
(20)
this model