Papers
arxiv:2604.28139

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

Published on Apr 30
· Submitted by
Chenxin Li
on May 1
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

Claw-Eval-Live presents a dynamic benchmark for evaluating workflow agents that tracks evolving demands and verifies task execution through detailed logging and structured assessment methods.

AI-generated summary

LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot. Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. For grading, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts, using deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions. The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule. Experiments reveal that reliable workflow automation remains far from solved: the leading model passes only 66.7% of tasks and no model reaches 70%. Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated. Leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, and task-level discrimination concentrates in a middle band of tasks. Claw-Eval-Live suggests that workflow-agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action.

Community

Paper submitter

Claw-Eval-Live is a live benchmark for LLM workflow agents. Each release is constructed from public workflow-demand signals (ClawHub Top-500 skills) rather than frozen at release time, and materialized as 105 executable tasks with fixed fixtures, services, and task-specific graders. Tasks span controlled business services (CRM, HR, finance, email, helpdesk, calendar) and local workspace repair. Grading combines deterministic checks with structured LLM judging only for semantic dimensions.

image

We evaluate 13 frontier models under a unified protocol. Key findings:

  • The top model (Claude Opus 4.6) passes only 66.7% of tasks, and no model reaches 70%.
  • Local workspace repair is near-ceiling, but service-backed business workflows remain the real bottleneck — HR averages 6.8%, management is all-fail, multi-system coordination stays hard.
  • Models with similar pass rates diverge substantially in overall completion, so leaderboard rank alone is insufficient.

image

Project page: https://claw-eval-live.github.io

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.28139
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.28139 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.28139 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.28139 in a Space README.md to link it from this page.

Collections including this paper 1