Abstract
TimeLens establishes a robust baseline for video temporal grounding by improving benchmark quality, addressing noisy training data, and developing efficient algorithmic design principles for multimodal large language models.
This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.
Community
š Project Page: https://timelens-arc-lab.github.io/
š» Code: https://github.com/TencentARC/TimeLens
š¤ Model & Data: https://huggingface.co/collections/TencentARC/timelens
š TimeLens-Bench Leaderboard: https://timelens-arc-lab.github.io/#leaderboard
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ViRectify: A Challenging Benchmark for Video Reasoning Correction with Multimodal Large Language Models (2025)
- ChronusOmni: Improving Time Awareness of Omni Large Language Models (2025)
- VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models (2025)
- VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding (2026)
- LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling (2025)
- R-AVST: Empowering Video-LLMs with Fine-Grained Spatio-Temporal Reasoning in Complex Audio-Visual Scenarios (2025)
- Vidi2: Large Multimodal Models for Video Understanding and Creation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper