VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models Paper • 2603.06148 • Published Mar 6 • 2
Do Composed Image Retrieval Benchmarks Require Multimodal Composition? Paper • 2605.14787 • Published May 15
Can I Have Your Order? Monte-Carlo Tree Search for Slot Filling Ordering in Diffusion Language Models Paper • 2602.12586 • Published Feb 13 • 2
SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks Paper • 2605.31433 • Published 27 days ago • 28
Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation Paper • 2606.12594 • Published 15 days ago • 17
Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation Paper • 2606.12594 • Published 15 days ago • 17
SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks Paper • 2605.31433 • Published 27 days ago • 28
Can I Have Your Order? Monte-Carlo Tree Search for Slot Filling Ordering in Diffusion Language Models Paper • 2602.12586 • Published Feb 13 • 2
Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs Paper • 2512.05648 • Published Dec 5, 2025
The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity? Paper • 2601.23045 • Published Jan 30
Learning GUI Grounding with Spatial Reasoning from Visual Feedback Paper • 2509.21552 • Published Sep 25, 2025 • 11
Theorem Prover as a Judge for Synthetic Data Generation Paper • 2502.13137 • Published Feb 18, 2025 • 1
PiCSAR: Probabilistic Confidence Selection And Ranking Paper • 2508.21787 • Published Aug 29, 2025 • 4
PiCSAR: Probabilistic Confidence Selection And Ranking Paper • 2508.21787 • Published Aug 29, 2025 • 4
Self-Training Large Language Models for Tool-Use Without Demonstrations Paper • 2502.05867 • Published Feb 9, 2025
Parameter-Efficient Fine-Tuning of LLaMA for the Clinical Domain Paper • 2307.03042 • Published Jul 6, 2023
Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them Paper • 2507.10616 • Published Jul 13, 2025 • 1
What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations Paper • 2502.08279 • Published Feb 12, 2025 • 1