Router-Suggest
Collection
Finetuned checkpoints of VLMs for Multimodal Auto-completion โข 7 items โข Updated
This model generates conversational responses conditioned on both textual and visual context. It is suitable for:
The model is not intended for:
Example usage with Hugging Face Transformers:
from transformers import AutoProcessor, AutoModelForVision2Seq
processor = AutoProcessor.from_pretrained("devichand/Qwen2-VL-ImgChat-2B")
model = AutoModelForVision2Seq.from_pretrained("devichand/Qwen2-VL-ImgChat-2B")
inputs = processor(images=your_image,
text="Describe the image.",
return_tensors="pt")
outputs = model.generate(**inputs)
print(processor.decode(outputs[0]))