YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
SigLIP SO400M Patch14 384
Model
SigLIP SO400M Patch14 384 โ dual-encoder vision-language model for zero-shot image-text similarity.
- Vision encoder: 27-layer ViT-SO400M, patch size 14, image size 384ร384
- Text encoder: 27-layer transformer, vocab 32 000, max sequence length 64
- Hidden size: 1152, intermediate size: 4304, attention heads: 16
Reference implementation: reference-llm-models / siglip-so400m
Available weights
| Directory | File | Dtype | Tensors |
|---|---|---|---|
siglip_so400m_patch14_384_fp16/ |
siglip_so400m_fp16.npz |
FP16 | 888 |
siglip_so400m_patch14_384_fp16/ |
mmproj-siglip-f16.gguf |
F16 | 892 |
siglip_so400m_patch14_384_fp32/ |
siglip_so400m_fp32.npz |
FP32 | 888 |
siglip_so400m_patch14_384_fp32/ |
mmproj-siglip-f32.gguf |
F32 | 892 |
Both NPZ files contain the complete model: vision encoder (vision_model.*),
text encoder (text_model.*), and contrastive scalars (logit_scale, logit_bias).
How to run
export USE_TORCH=1
python ./scripts/generate.py \
--weights ./siglip_so400m_patch14_384_fp16/siglip_so400m_fp16.npz \
--image_path cat.jpg \
--text "a photo of a cat"
Synthetic image (for validation)
export USE_TORCH=1
python ./scripts/generate.py \
--weights ./siglip_so400m_patch14_384_fp16/siglip_so400m_fp16.npz \
--image_seed 42 \
--text "a photo of a cat"
Dump intermediates for numerical comparison
export USE_TORCH=1
python ./scripts/generate.py \
--weights ./siglip_so400m_patch14_384_fp16/siglip_so400m_fp16.npz \
--image_seed 42 \
--text "a photo of a cat" \
--return_intermediates --output_npz intermediates.npz
Key configuration
Config file: model_config.json
| Parameter | Value |
|---|---|
HIDDEN_SIZE |
1152 |
INTERMEDIATE_SIZE |
4304 |
NUM_HEADS |
16 |
HEAD_DIM |
72 |
NUM_LAYERS |
27 |
PATCH_SIZE |
14 |
IMAGE_SIZE |
384 |
NUM_PATCHES |
729 |
VOCAB_SIZE |
32000 |
MAX_POSITION_EMBEDDINGS |
64 |
Controlled via model_config.json, read by siglip_model_text.py and siglip_model_vision.py.
Note:
export USE_TORCH=1is mandatory. The reference code is based ontorch_extend_ops.pywhich requires PyTorch with CUDA.
Input / Output
Input
- Image: path to a local file (
--image_path), a URL (--image_url), or a synthetic random image via seed (--image_seed) - Text: query string (
--text), padded / truncated to 64 tokens - Dtype:
--dtype fp16(default) or--dtype fp32
Output
Loading weights from siglip_so400m_fp16.npz โฆ
SigLIP probability: tensor([0.9142])
The output is a sigmoid probability in [0, 1] โ higher means the image and
text are more similar.
- Downloads last month
- 75