Instructions to use Reza2kn/sam3.1-nvfp4-detector-no-language with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Reza2kn/sam3.1-nvfp4-detector-no-language with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="Reza2kn/sam3.1-nvfp4-detector-no-language")# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("Reza2kn/sam3.1-nvfp4-detector-no-language") model = AutoModelForMultimodalLM.from_pretrained("Reza2kn/sam3.1-nvfp4-detector-no-language") - Notebooks
- Google Colab
- Kaggle
SAM 3.1 NVFP4 โ Detector (No Language), Static-6
NVFP4 quantized variant of facebook/sam3.1. Quantizes the detector backbone's vision trunk while keeping the language backbone in FP32. This is currently a storage-size quantization: the NomNomLabel loader dequantizes weights into the official SAM3 image model before execution; native packed-FP4 execution remains future work.
Quantization Details
| Property | Value |
|---|---|
| Method | NVFP4 (custom_nvfp4_e2m1_e4m3_scales) |
| Block size | 16 |
| Scale rule | static_6 |
| Quantized scope | detector.* (vision trunk only) |
| Language backbone | FP32 (kept raw for prompt accuracy) |
| Total parameters | 874 M |
| Quantized parameters | 472 M (54.0%) |
| FP32-kept parameters | 402 M (46.0%) |
Fidelity
Validated against source SAM 3.1 on mukbang / food video frames with Sapiens-2 human-part exclusion masking:
| Threshold | Mean foreground IoU | Pixel agreement |
|---|---|---|
| conf = 0.35 | 0.988 | 0.99897 |
Some NVFP4 masks are visually sharper than the source FP32 output due to quantization-induced de-noising.
Known upstream loader warning
The current facebook/sam3.1 checkpoint also reports four missing
backbone.vision_backbone.convs.3.* keys when loaded through the current SAM3
image-model builder. This warning is inherited from the upstream checkpoint /
builder combination and is not introduced by the NVFP4 packing. The NomNomLabel
loader suppresses this exact known upstream notice, while still surfacing any
other missing-key output.
Files
| File | Description |
|---|---|
nvfp4_model.safetensors |
NVFP4-packed detector tensors |
sam3.1_multiplex.pt |
FP32 non-quantized tensors (language backbone, heads) |
config.json |
SAM 3.1 model config |
quantization_config.json |
NVFP4 packing metadata |
quant_error_report.json |
Per-tensor L2 error report |
tokenizer*.json / vocab.json / merges.txt |
Language-backbone tokenizer |
Usage
# Load the quantized model using the nomnomlabel loader
from nomnomlabel.quant_loader import load_sam3_nvfp4
model = load_sam3_nvfp4("Reza2kn/sam3.1-nvfp4-detector-no-language")
# Segment food (with language prompt)
from nomnomlabel.sam3_food_classifier import SAM3FoodClassifier
classifier = SAM3FoodClassifier(model_id="Reza2kn/sam3.1-nvfp4-detector-no-language")
segments = classifier.segment_and_classify_food(image, conf_threshold=0.35)
for seg in segments:
print(f"Food: {seg.food_type} (conf={seg.food_conf:.2f})")
print(f" Mask area: {seg.area} pixels")
print(f" BBox: {seg.bbox}")
Training / Benchmark Context
Built as part of the NomNomLabel Sapiens2 Benchmark for food instance segmentation in long-form mukbang / eating content. The teacher stack uses:
- SAM 3.1 NVFP4 (this model) for open-vocabulary food instance segmentation
- Sapiens-2 INT4-G128 for human-part segmentation and exclusion masking
- Sapiens-2 normals, pointmap/depth proxy, and matting as auxiliary human-scene teachers
Source
Base checkpoint: facebook/sam3.1.
Quantized by Reza2kn using torchao NVFP4.
- Downloads last month
- 43
Model tree for Reza2kn/sam3.1-nvfp4-detector-no-language
Base model
facebook/sam3.1