SAM 3.1 NVFP4 — Detector (No Language), Static-6

NVFP4 quantized variant of facebook/sam3.1. Quantizes the detector backbone's vision trunk while keeping the language backbone in FP32. This is currently a storage-size quantization: the NomNomLabel loader dequantizes weights into the official SAM3 image model before execution; native packed-FP4 execution remains future work.

Quantization Details

Property	Value
Method	NVFP4 (`custom_nvfp4_e2m1_e4m3_scales`)
Block size	16
Scale rule	`static_6`
Quantized scope	`detector.*` (vision trunk only)
Language backbone	FP32 (kept raw for prompt accuracy)
Total parameters	874 M
Quantized parameters	472 M (54.0%)
FP32-kept parameters	402 M (46.0%)

Fidelity

Validated against source SAM 3.1 on mukbang / food video frames with Sapiens-2 human-part exclusion masking:

Threshold	Mean foreground IoU	Pixel agreement
conf = 0.35	0.988	0.99897

Some NVFP4 masks are visually sharper than the source FP32 output due to quantization-induced de-noising.

Known upstream loader warning

The current facebook/sam3.1 checkpoint also reports four missing backbone.vision_backbone.convs.3.* keys when loaded through the current SAM3 image-model builder. This warning is inherited from the upstream checkpoint / builder combination and is not introduced by the NVFP4 packing. The NomNomLabel loader suppresses this exact known upstream notice, while still surfacing any other missing-key output.

Files

File	Description
`nvfp4_model.safetensors`	NVFP4-packed detector tensors
`sam3.1_multiplex.pt`	FP32 non-quantized tensors (language backbone, heads)
`config.json`	SAM 3.1 model config
`quantization_config.json`	NVFP4 packing metadata
`quant_error_report.json`	Per-tensor L2 error report
`tokenizer*.json / vocab.json / merges.txt`	Language-backbone tokenizer

Usage

# Load the quantized model using the nomnomlabel loader
from nomnomlabel.quant_loader import load_sam3_nvfp4

model = load_sam3_nvfp4("Reza2kn/sam3.1-nvfp4-detector-no-language")

# Segment food (with language prompt)
from nomnomlabel.sam3_food_classifier import SAM3FoodClassifier

classifier = SAM3FoodClassifier(model_id="Reza2kn/sam3.1-nvfp4-detector-no-language")
segments = classifier.segment_and_classify_food(image, conf_threshold=0.35)

for seg in segments:
    print(f"Food: {seg.food_type} (conf={seg.food_conf:.2f})")
    print(f"  Mask area: {seg.area} pixels")
    print(f"  BBox: {seg.bbox}")

Training / Benchmark Context

Built as part of the NomNomLabel Sapiens2 Benchmark for food instance segmentation in long-form mukbang / eating content. The teacher stack uses:

SAM 3.1 NVFP4 (this model) for open-vocabulary food instance segmentation
Sapiens-2 INT4-G128 for human-part segmentation and exclusion masking
Sapiens-2 normals, pointmap/depth proxy, and matting as auxiliary human-scene teachers

Source

Base checkpoint: facebook/sam3.1.

Quantized by Reza2kn using torchao NVFP4.

Downloads last month: 43

Model tree for Reza2kn/sam3.1-nvfp4-detector-no-language

Base model

facebook/sam3.1

Quantized

(5)

this model