WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces Paper • 2606.09426 • Published 21 days ago • 104
PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination Paper • 2605.03571 • Published May 5 • 7
FlowPIE: Test-Time Scientific Idea Evolution with Flow-Guided Literature Exploration Paper • 2603.29557 • Published Mar 31 • 17