SigLIP SO400M Patch14 384

Model

SigLIP SO400M Patch14 384 — dual-encoder vision-language model for zero-shot image-text similarity.

Vision encoder: 27-layer ViT-SO400M, patch size 14, image size 384×384
Text encoder: 27-layer transformer, vocab 32 000, max sequence length 64
Hidden size: 1152, intermediate size: 4304, attention heads: 16

Reference implementation: reference-llm-models / siglip-so400m

Available weights

Directory	File	Dtype	Tensors
`siglip_so400m_patch14_384_fp16/`	`siglip_so400m_fp16.npz`	FP16	888
`siglip_so400m_patch14_384_fp16/`	`mmproj-siglip-f16.gguf`	F16	892
`siglip_so400m_patch14_384_fp32/`	`siglip_so400m_fp32.npz`	FP32	888
`siglip_so400m_patch14_384_fp32/`	`mmproj-siglip-f32.gguf`	F32	892

Both NPZ files contain the complete model: vision encoder (vision_model.*), text encoder (text_model.*), and contrastive scalars (logit_scale, logit_bias).

How to run

export USE_TORCH=1
python ./scripts/generate.py \
    --weights ./siglip_so400m_patch14_384_fp16/siglip_so400m_fp16.npz \
    --image_path cat.jpg \
    --text "a photo of a cat"

Synthetic image (for validation)

export USE_TORCH=1
python ./scripts/generate.py \
    --weights ./siglip_so400m_patch14_384_fp16/siglip_so400m_fp16.npz \
    --image_seed 42 \
    --text "a photo of a cat"

Dump intermediates for numerical comparison

export USE_TORCH=1
python ./scripts/generate.py \
    --weights ./siglip_so400m_patch14_384_fp16/siglip_so400m_fp16.npz \
    --image_seed 42 \
    --text "a photo of a cat" \
    --return_intermediates --output_npz intermediates.npz

Key configuration

Config file: model_config.json

Parameter	Value
`HIDDEN_SIZE`	1152
`INTERMEDIATE_SIZE`	4304
`NUM_HEADS`	16
`HEAD_DIM`	72
`NUM_LAYERS`	27
`PATCH_SIZE`	14
`IMAGE_SIZE`	384
`NUM_PATCHES`	729
`VOCAB_SIZE`	32000
`MAX_POSITION_EMBEDDINGS`	64

Controlled via model_config.json, read by siglip_model_text.py and siglip_model_vision.py.

Note: export USE_TORCH=1 is mandatory. The reference code is based on torch_extend_ops.py which requires PyTorch with CUDA.

Input / Output

Input

Image: path to a local file (--image_path), a URL (--image_url), or a synthetic random image via seed (--image_seed)
Text: query string (--text), padded / truncated to 64 tokens
Dtype: --dtype fp16 (default) or --dtype fp32

Output

Loading weights from siglip_so400m_fp16.npz …
SigLIP probability: tensor([0.9142])

The output is a sigmoid probability in [0, 1] — higher means the image and text are more similar.

Downloads last month: 75

GGUF

Model size

0.9B params

Architecture

clip

Hardware compatibility

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support