Foundry-VLM-1.3B-165M

A 1.3B parameter vision-language model trained on 165M image-caption samples, part of the VLA Foundry collection.

Model Description

  • Architecture: ViT encoder (12 layers, 768 hidden dim, patch size 14, pixel-shuffle 2x) + Transformer decoder (24 layers, 2048 hidden dim, 16 heads)
  • Parameters: 1.3B (non-embedding)
  • Processor: SmolVLM2
  • Training data: 165M image-caption pairs from DataComp-DR-1B
  • LR schedule: Warmup + constant (no decay)
  • LLM backbone: Initialized from Foundry-LLM-1.2B-800B

Earlier checkpoint of the Foundry VLM. Used as the vision-language backbone for the Foundry-VLA-1.7B action models.

Evaluation Results

COCO-val captioning:

BLEU-1 BLEU-2 BLEU-3 BLEU-4 ROUGE-L CIDEr
57.25 37.12 23.23 14.44 37.13 50.17

Usage

git clone https://github.com/TRI-ML/vla_foundry.git
cd vla_foundry
pip install -e .
from vla_foundry.models.base_model import BaseModel
model = BaseModel.from_pretrained("TRI-ML/Foundry-VLM-1.3B-165M")

Links

Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including TRI-ML/Foundry-VLM-1.3B-165M

Paper for TRI-ML/Foundry-VLM-1.3B-165M