MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models Paper • 2603.25744 • Published 6 days ago • 12
MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models Paper • 2603.25744 • Published 6 days ago • 12
Unified Spatio-Temporal Token Scoring for Efficient Video VLMs Paper • 2603.18004 • Published 14 days ago • 12
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding Paper • 2601.10611 • Published Jan 15 • 32
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models Paper • 2410.10818 • Published Oct 14, 2024 • 16
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models Paper • 2410.10818 • Published Oct 14, 2024 • 16
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models Paper • 2410.10818 • Published Oct 14, 2024 • 16
CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples Paper • 2402.13254 • Published Feb 20, 2024
VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation Paper • 2407.10972 • Published Jul 15, 2024 • 1
Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos Paper • 2410.02763 • Published Oct 3, 2024 • 7
Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos Paper • 2410.02763 • Published Oct 3, 2024 • 7
VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation Paper • 2407.10972 • Published Jul 15, 2024 • 1
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy Paper • 2406.20095 • Published Jun 28, 2024 • 18
LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model Paper • 2405.02363 • Published May 3, 2024 • 1
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want Paper • 2403.20271 • Published Mar 29, 2024 • 3