YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

SigLIP SO400M Patch14 384

Model

SigLIP SO400M Patch14 384 โ€” dual-encoder vision-language model for zero-shot image-text similarity.

  • Vision encoder: 27-layer ViT-SO400M, patch size 14, image size 384ร—384
  • Text encoder: 27-layer transformer, vocab 32 000, max sequence length 64
  • Hidden size: 1152, intermediate size: 4304, attention heads: 16

Reference implementation: reference-llm-models / siglip-so400m


Available weights

Directory File Dtype Tensors
siglip_so400m_patch14_384_fp16/ siglip_so400m_fp16.npz FP16 888
siglip_so400m_patch14_384_fp16/ mmproj-siglip-f16.gguf F16 892
siglip_so400m_patch14_384_fp32/ siglip_so400m_fp32.npz FP32 888
siglip_so400m_patch14_384_fp32/ mmproj-siglip-f32.gguf F32 892

Both NPZ files contain the complete model: vision encoder (vision_model.*), text encoder (text_model.*), and contrastive scalars (logit_scale, logit_bias).


How to run

export USE_TORCH=1
python ./scripts/generate.py \
    --weights ./siglip_so400m_patch14_384_fp16/siglip_so400m_fp16.npz \
    --image_path cat.jpg \
    --text "a photo of a cat"

Synthetic image (for validation)

export USE_TORCH=1
python ./scripts/generate.py \
    --weights ./siglip_so400m_patch14_384_fp16/siglip_so400m_fp16.npz \
    --image_seed 42 \
    --text "a photo of a cat"

Dump intermediates for numerical comparison

export USE_TORCH=1
python ./scripts/generate.py \
    --weights ./siglip_so400m_patch14_384_fp16/siglip_so400m_fp16.npz \
    --image_seed 42 \
    --text "a photo of a cat" \
    --return_intermediates --output_npz intermediates.npz

Key configuration

Config file: model_config.json

Parameter Value
HIDDEN_SIZE 1152
INTERMEDIATE_SIZE 4304
NUM_HEADS 16
HEAD_DIM 72
NUM_LAYERS 27
PATCH_SIZE 14
IMAGE_SIZE 384
NUM_PATCHES 729
VOCAB_SIZE 32000
MAX_POSITION_EMBEDDINGS 64

Controlled via model_config.json, read by siglip_model_text.py and siglip_model_vision.py.

Note: export USE_TORCH=1 is mandatory. The reference code is based on torch_extend_ops.py which requires PyTorch with CUDA.


Input / Output

Input

  • Image: path to a local file (--image_path), a URL (--image_url), or a synthetic random image via seed (--image_seed)
  • Text: query string (--text), padded / truncated to 64 tokens
  • Dtype: --dtype fp16 (default) or --dtype fp32

Output

Loading weights from siglip_so400m_fp16.npz โ€ฆ
SigLIP probability: tensor([0.9142])

The output is a sigmoid probability in [0, 1] โ€” higher means the image and text are more similar.

Downloads last month
75
GGUF
Model size
0.9B params
Architecture
clip
Hardware compatibility
Log In to add your hardware
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support