DeepSeek-V2-Lite-NVFP4

NVFP4 (W4A4) quantized version of deepseek-ai/DeepSeek-V2-Lite, quantized using llm-compressor.

Model Details


Base model	deepseek-ai/DeepSeek-V2-Lite (15.7B params)
Architecture	DeepseekV2ForCausalLM (MLA attention + MoE)
Quantization	NVFP4 — 4-bit floating point weights and activations
Format	compressed-tensors (nvfp4-pack-quantized)
Size	~8.9 GB (3.5x compression from BF16)
Group size	16
Scale dtype	float8_e4m3fn

Quantization Details

Method: Post-training quantization (PTQ) via llm-compressor oneshot
Scheme: NVFP4 — weights and input activations quantized to 4-bit float
Calibration: 20 samples from HuggingFaceFW/fineweb-edu (sample-10BT)
Ignored layers: lm_head (kept in original precision)
Scales: per-tensor global scale (FP32) + per-group local scale (FP8, group size 16)

Usage with vLLM

Requires a GPU with NVFP4 tensor core support (NVIDIA Blackwell, SM100+).

vllm serve carlyou/DeepSeek-V2-Lite-NVFP4 \
    --trust-remote-code \
    --max-model-len 2048

from vllm import LLM, SamplingParams

llm = LLM(
    model="carlyou/DeepSeek-V2-Lite-NVFP4",
    trust_remote_code=True,
    max_model_len=2048,
)

output = llm.generate("Hello, world!", SamplingParams(max_tokens=128))
print(output[0].outputs[0].text)

Intended Use

This model is primarily intended for benchmarking and testing NVFP4 quantization support in vLLM, particularly MLA attention + quantization fusion patterns on Blackwell GPUs.

Limitations

Requires Blackwell GPU (SM100+) for FP4 tensor core acceleration
Quantization may degrade output quality compared to FP8 or BF16 versions
Not evaluated on standard benchmarks — use for testing/benchmarking only

Downloads last month: 346

Safetensors

Model size

9B params

Tensor type

F32

BF16

F8_E4M3

Model tree for carlyou/DeepSeek-V2-Lite-NVFP4

Base model

deepseek-ai/DeepSeek-V2-Lite

Quantized

(20)

this model