SmolVLM Resolution Gate

A lightweight, efficient classifier for predicting whether sufficient visual information is available at different image resolutions to answer questions about document images.

Model Details

Model Architecture

  • Base Model: HuggingFaceTB/SmolVLM-256M-Instruct
  • Approach: External MLP classification head on frozen SmolVLM features
  • Classification Task: Multi-class (3 resolution levels)
  • Trainable Parameters: ~64K (0.025% of total)
  • Total Parameters: 256M

Key Features

✨ Lightweight: Only 256M parameters for efficient inference ⚡ Fast: Frozen base model + small classification head 🎯 Accurate: 3-class resolution prediction (low/medium/high) 📦 Portable: Perfect for on-device and edge deployment 🔧 Efficient Training: Minimal parameter updates via frozen features

Model Card

Intended Use

This model predicts whether sufficient visual information is present at different resolutions to accurately answer questions about document images. It's designed to optimize computational cost by identifying when lower resolutions are sufficient.

Primary Use Cases:

  • Document understanding systems needing resolution optimization
  • Real-time vision-language model inference
  • Edge device deployment
  • Multi-resolution processing pipelines

Supported Resolution Classes

  • Class 0 (Low): Low resolution (384×384) is sufficient
  • Class 1 (Medium): Medium resolution (512×512) is recommended
  • Class 2 (High): High resolution (768×768+) is required

Training Details

Dataset

  • Name: hardness_data_mix
  • Samples: 81,924 document image-question pairs
  • Split: 90% train / 10% validation
  • Labels: Stratified by resolution requirement class
  • Domains: TextVQA, DocVQA, ChartQA, InfographicVQA, HME100K

Training Configuration

Batch Size: 64
Learning Rate: 1e-4
Optimizer: AdamW
Epochs: Variable (6+ recommended)
Warmup Steps: Automatic
Loss: Cross-entropy with class weighting
Hardware: NVIDIA H100 (80GB)
Framework: PyTorch + Transformers + Accelerate

Hyperparameters

  • --model_name: HuggingFaceTB/SmolVLM-256M-Instruct
  • --bsz: 64
  • --lr: 1e-4
  • --epochs: 6-10
  • --val_frac: 0.1
  • --seed: 42 (for reproducibility)

Performance Metrics

Evaluated on stratified validation set (8,192 samples):

  • Accuracy: ~80-85%
  • Precision (macro): 0.78-0.82
  • Recall (macro): 0.78-0.82
  • ROC-AUC: 0.85-0.90
  • Inference Speed: ~3-5 ms per sample (GPU), ~50-100 ms (CPU)

Usage

Installation

pip install transformers torch accelerate

Load Model

from transformers import AutoProcessor, AutoModel

model = AutoModel.from_pretrained("Kimhi/smolvlm-res-gate")
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-256M-Instruct")

Inference

from PIL import Image
import torch

# Load image and prepare input
image = Image.open("document.jpg")
question = "What is the main topic of this document?"

inputs = processor(images=image, text=question, return_tensors="pt")

# Get prediction
with torch.no_grad():
    outputs = model(**inputs)

# Resolution prediction (0=low, 1=med, 2=high)
resolution_class = outputs.logits.argmax(dim=-1).item()
confidence = outputs.logits.softmax(dim=-1).max().item()

print(f"Predicted resolution class: {resolution_class}")
print(f"Confidence: {confidence:.2%}")

Batch Inference

import torch

# Process multiple images
batch_images = [Image.open(f) for f in image_paths]
batch_questions = ["Question 1", "Question 2", ...]

inputs = processor(
    images=batch_images, 
    text=batch_questions, 
    return_tensors="pt",
    padding=True
)

with torch.no_grad():
    outputs = model(**inputs)

predictions = outputs.logits.argmax(dim=-1)

Limitations

  • Trained primarily on document-centric datasets
  • Performance may vary on out-of-distribution image types
  • Assumes good image quality (works best with scanned documents)
  • Class imbalance in training data may affect prediction confidence

Alternative Approach

For an autoregressive alternative using end-to-end fine-tuning: 👉 Granite-Docling Resolution Gate (LoRA)

Aspect SmolVLM Granite-Docling
Model Size 256M 258M
Approach Frozen + classifier SFT with LoRA
Trainable Params 64K 1.4M
Inference Type Classification Autoregressive
Inference Speed Fast ⚡ Medium ⚡⚡
Output Confidence scores Direct text
Deployment On-device optimized Production-ready

Citation

If you use this model, please cite:

@misc{kimhi2025carescontextawareresolutionselector,
      title={CARES: Context-Aware Resolution Selector for VLMs}, 
      author={Moshe Kimhi and Nimrod Shabtay and Raja Giryes and Chaim Baskin and Eli Schwartz},
      year={2025},
      eprint={2510.19496},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
}

License

Apache 2.0 - See LICENSE file for details

Acknowledgements

Built on top of HuggingFaceTB's SmolVLM and trained using the Hugging Face Transformers library.

Model Sources

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kimhi/smolvlm-res-gate

Dataset used to train Kimhi/smolvlm-res-gate

Paper for Kimhi/smolvlm-res-gate