SmolVLM Resolution Gate
A lightweight, efficient classifier for predicting whether sufficient visual information is available at different image resolutions to answer questions about document images.
Model Details
Model Architecture
- Base Model: HuggingFaceTB/SmolVLM-256M-Instruct
- Approach: External MLP classification head on frozen SmolVLM features
- Classification Task: Multi-class (3 resolution levels)
- Trainable Parameters: ~64K (0.025% of total)
- Total Parameters: 256M
Key Features
✨ Lightweight: Only 256M parameters for efficient inference ⚡ Fast: Frozen base model + small classification head 🎯 Accurate: 3-class resolution prediction (low/medium/high) 📦 Portable: Perfect for on-device and edge deployment 🔧 Efficient Training: Minimal parameter updates via frozen features
Model Card
Intended Use
This model predicts whether sufficient visual information is present at different resolutions to accurately answer questions about document images. It's designed to optimize computational cost by identifying when lower resolutions are sufficient.
Primary Use Cases:
- Document understanding systems needing resolution optimization
- Real-time vision-language model inference
- Edge device deployment
- Multi-resolution processing pipelines
Supported Resolution Classes
- Class 0 (Low): Low resolution (384×384) is sufficient
- Class 1 (Medium): Medium resolution (512×512) is recommended
- Class 2 (High): High resolution (768×768+) is required
Training Details
Dataset
- Name:
hardness_data_mix - Samples: 81,924 document image-question pairs
- Split: 90% train / 10% validation
- Labels: Stratified by resolution requirement class
- Domains: TextVQA, DocVQA, ChartQA, InfographicVQA, HME100K
Training Configuration
Batch Size: 64
Learning Rate: 1e-4
Optimizer: AdamW
Epochs: Variable (6+ recommended)
Warmup Steps: Automatic
Loss: Cross-entropy with class weighting
Hardware: NVIDIA H100 (80GB)
Framework: PyTorch + Transformers + Accelerate
Hyperparameters
--model_name: HuggingFaceTB/SmolVLM-256M-Instruct--bsz: 64--lr: 1e-4--epochs: 6-10--val_frac: 0.1--seed: 42 (for reproducibility)
Performance Metrics
Evaluated on stratified validation set (8,192 samples):
- Accuracy: ~80-85%
- Precision (macro): 0.78-0.82
- Recall (macro): 0.78-0.82
- ROC-AUC: 0.85-0.90
- Inference Speed: ~3-5 ms per sample (GPU), ~50-100 ms (CPU)
Usage
Installation
pip install transformers torch accelerate
Load Model
from transformers import AutoProcessor, AutoModel
model = AutoModel.from_pretrained("Kimhi/smolvlm-res-gate")
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-256M-Instruct")
Inference
from PIL import Image
import torch
# Load image and prepare input
image = Image.open("document.jpg")
question = "What is the main topic of this document?"
inputs = processor(images=image, text=question, return_tensors="pt")
# Get prediction
with torch.no_grad():
outputs = model(**inputs)
# Resolution prediction (0=low, 1=med, 2=high)
resolution_class = outputs.logits.argmax(dim=-1).item()
confidence = outputs.logits.softmax(dim=-1).max().item()
print(f"Predicted resolution class: {resolution_class}")
print(f"Confidence: {confidence:.2%}")
Batch Inference
import torch
# Process multiple images
batch_images = [Image.open(f) for f in image_paths]
batch_questions = ["Question 1", "Question 2", ...]
inputs = processor(
images=batch_images,
text=batch_questions,
return_tensors="pt",
padding=True
)
with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)
Limitations
- Trained primarily on document-centric datasets
- Performance may vary on out-of-distribution image types
- Assumes good image quality (works best with scanned documents)
- Class imbalance in training data may affect prediction confidence
Alternative Approach
For an autoregressive alternative using end-to-end fine-tuning: 👉 Granite-Docling Resolution Gate (LoRA)
| Aspect | SmolVLM | Granite-Docling |
|---|---|---|
| Model Size | 256M | 258M |
| Approach | Frozen + classifier | SFT with LoRA |
| Trainable Params | 64K | 1.4M |
| Inference Type | Classification | Autoregressive |
| Inference Speed | Fast ⚡ | Medium ⚡⚡ |
| Output | Confidence scores | Direct text |
| Deployment | On-device optimized | Production-ready |
Citation
If you use this model, please cite:
@misc{kimhi2025carescontextawareresolutionselector,
title={CARES: Context-Aware Resolution Selector for VLMs},
author={Moshe Kimhi and Nimrod Shabtay and Raja Giryes and Chaim Baskin and Eli Schwartz},
year={2025},
eprint={2510.19496},
archivePrefix={arXiv},
primaryClass={cs.CV},
}
License
Apache 2.0 - See LICENSE file for details
Acknowledgements
Built on top of HuggingFaceTB's SmolVLM and trained using the Hugging Face Transformers library.
Model Sources
- Base Model: SmolVLM-256M-Instruct
- Training Framework: Transformers
- Project: CARES GitHub
Model tree for Kimhi/smolvlm-res-gate
Base model
HuggingFaceTB/SmolLM2-135M