SmolVLM Resolution Gate

A lightweight, efficient classifier for predicting whether sufficient visual information is available at different image resolutions to answer questions about document images.

Model Details

Model Architecture

Base Model: HuggingFaceTB/SmolVLM-256M-Instruct
Approach: External MLP classification head on frozen SmolVLM features
Classification Task: Multi-class (3 resolution levels)
Trainable Parameters: ~64K (0.025% of total)
Total Parameters: 256M

Key Features

✨ Lightweight: Only 256M parameters for efficient inference ⚡ Fast: Frozen base model + small classification head 🎯 Accurate: 3-class resolution prediction (low/medium/high) 📦 Portable: Perfect for on-device and edge deployment 🔧 Efficient Training: Minimal parameter updates via frozen features

Model Card

Intended Use

This model predicts whether sufficient visual information is present at different resolutions to accurately answer questions about document images. It's designed to optimize computational cost by identifying when lower resolutions are sufficient.

Primary Use Cases:

Document understanding systems needing resolution optimization
Real-time vision-language model inference
Edge device deployment
Multi-resolution processing pipelines

Supported Resolution Classes

Class 0 (Low): Low resolution (384×384) is sufficient
Class 1 (Medium): Medium resolution (512×512) is recommended
Class 2 (High): High resolution (768×768+) is required

Training Details

Dataset

Name: hardness_data_mix
Samples: 81,924 document image-question pairs
Split: 90% train / 10% validation
Labels: Stratified by resolution requirement class
Domains: TextVQA, DocVQA, ChartQA, InfographicVQA, HME100K

Training Configuration

Batch Size: 64
Learning Rate: 1e-4
Optimizer: AdamW
Epochs: Variable (6+ recommended)
Warmup Steps: Automatic
Loss: Cross-entropy with class weighting
Hardware: NVIDIA H100 (80GB)
Framework: PyTorch + Transformers + Accelerate

Hyperparameters

--model_name: HuggingFaceTB/SmolVLM-256M-Instruct
--bsz: 64
--lr: 1e-4
--epochs: 6-10
--val_frac: 0.1
--seed: 42 (for reproducibility)

Performance Metrics

Evaluated on stratified validation set (8,192 samples):

Accuracy: ~80-85%
Precision (macro): 0.78-0.82
Recall (macro): 0.78-0.82
ROC-AUC: 0.85-0.90
Inference Speed: ~3-5 ms per sample (GPU), ~50-100 ms (CPU)

Usage

Installation

pip install transformers torch accelerate

Load Model

from transformers import AutoProcessor, AutoModel

model = AutoModel.from_pretrained("Kimhi/smolvlm-res-gate")
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-256M-Instruct")

Inference

from PIL import Image
import torch

# Load image and prepare input
image = Image.open("document.jpg")
question = "What is the main topic of this document?"

inputs = processor(images=image, text=question, return_tensors="pt")

# Get prediction
with torch.no_grad():
    outputs = model(**inputs)

# Resolution prediction (0=low, 1=med, 2=high)
resolution_class = outputs.logits.argmax(dim=-1).item()
confidence = outputs.logits.softmax(dim=-1).max().item()

print(f"Predicted resolution class: {resolution_class}")
print(f"Confidence: {confidence:.2%}")

Batch Inference

import torch

# Process multiple images
batch_images = [Image.open(f) for f in image_paths]
batch_questions = ["Question 1", "Question 2", ...]

inputs = processor(
    images=batch_images, 
    text=batch_questions, 
    return_tensors="pt",
    padding=True
)

with torch.no_grad():
    outputs = model(**inputs)

predictions = outputs.logits.argmax(dim=-1)

Limitations

Trained primarily on document-centric datasets
Performance may vary on out-of-distribution image types
Assumes good image quality (works best with scanned documents)
Class imbalance in training data may affect prediction confidence

Alternative Approach

For an autoregressive alternative using end-to-end fine-tuning: 👉 Granite-Docling Resolution Gate (LoRA)

Aspect	SmolVLM	Granite-Docling
Model Size	256M	258M
Approach	Frozen + classifier	SFT with LoRA
Trainable Params	64K	1.4M
Inference Type	Classification	Autoregressive
Inference Speed	Fast ⚡	Medium ⚡⚡
Output	Confidence scores	Direct text
Deployment	On-device optimized	Production-ready

Citation

If you use this model, please cite:

@misc{kimhi2025carescontextawareresolutionselector,
      title={CARES: Context-Aware Resolution Selector for VLMs}, 
      author={Moshe Kimhi and Nimrod Shabtay and Raja Giryes and Chaim Baskin and Eli Schwartz},
      year={2025},
      eprint={2510.19496},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
}