cua-bench: A Framework for Benchmarking, Training Data, and RL Environments for Computer-Use Agents

Community Article Published December 16, 2025

Upvote

cua-ai

cua-ai

TL;DR: Current computer-use agents (Claude Computer-Use, OpenAI CUA, Gemini 2.5 Computer-Use) show 10x performance variance across minor UI changes. We're releasing cua-bench—a framework for generating diverse training data, verified trajectories, and RL environments to fix this.

The Problem: 10x Variance Across Minor UI Changes

Recent advances in foundation models have given rise to autonomous agents capable of directly interacting with desktop computing environments: clicking buttons, typing text, navigating applications, and completing complex multi-step workflows.

But there's a problem: these agents are wildly inconsistent.

An agent that completes tasks on a clean desktop fails when windows overlap. Works on default themes, breaks on high contrast mode. Trained on Windows 11, fails on the exact same task in Windows XP.

As demonstrated by Ullrich et al. (OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability), task success rates can differ by over 10x depending on theme, font, or language settings.

The root cause: training data lacks visual diversity.

Current benchmarks like OSWorld and Windows Agent Arena rely on static VM snapshots with fixed configurations:

Static VMs: Tasks baked into bench-server components, requiring up to 20 minutes to load
Fixed application sets: VMs come preinstalled with predetermined applications
Limited task definitions: Tasks defined via JSON with constrained vocabularies
Slow iteration: Updating tasks requires rebuilding entire VM images

Cua-Bench addresses all of these limitations.

Introducing Cua-Bench

Cua-Bench is a flexible and scalable framework for constructing verifiable, dynamic computer-use environments. It supports:

Capability	macOS	Linux	Windows	Android	iOS	VM	Webtop
UI & GUI Data Generation	✓	✓	✓	✓	✓	✓	✓
Trajectory Data Generation	✓	✓	✓	✓	✓	✓	✓
Agentic Benchmarks	✓	✓	✓	✓	✓	✓	✓
Shell Apps & Simulators	✓	✓	✓	✓	✓	✓	✓
Real Apps	✓	✓	✓	✓	✓	✓	✗

1. Scalable GUI Data Generation

Cua-Bench generates realistic and diverse GUI data at scale—customizable across multiple dimensions:

Different programs and applications
Window placements and screen coverage
Different graphic styles, colors, and contrasts
Different platforms and devices
Different resolutions (640x480 to 3440x1440)

Examples of data generated by Cua-Bench across multiple OS themes and applications

HTML Snapshots

Beyond raw screenshots, Cua-Bench captures full HTML snapshots of each window along with:

Bounding box coordinates
Accessibility labels
CSS styles

This enables offline rendering and cross-OS replay of captured states.

GUI data with bounding boxes and accessibility labels around UI elements

Cross-Platform Variety

Cua-Bench generates data across desktop environments (Windows, macOS, Linux) and mobile environments (iOS, Android):

Different macOS interfaces with simulated desktop clutter

Android (top) and iOS (bottom) environments

Cross-Time Variety

Most GUI datasets focus exclusively on modern interfaces. This biases agents toward current visual styles, reducing robustness.

Cua-Bench generates data from old and new OS versions alike:

Top: Windows 98 | Bottom: Windows 10 — same tasks, different eras

Resolution Variety

From low-resolution (640x480) to high-resolution (3440x1440), Cua-Bench covers the full range:

High resolution (3440x1440) examples

Low resolution (640x480) examples

2. Agentic Trajectory Generation

Cua-Bench exposes a Playwright-like Python API for defining oracle solutions—programmatic reference implementations that complete tasks step-by-step.

Oracle Solutions

Each task can define a reference solution using a @cb.solve_task decorator:

@cb.tasks_config
def config():
    return {
        "scenarios": [
            {"playlist_name": "Workout Mix", "song": "Eye of the Tiger"},
            {"playlist_name": "Chill Vibes", "song": "Weightless"},
            # ... thousands of variations from a single template
        ]
    }

@cb.setup_task
async def setup(env, scenario):
    await env.spotify.open()
    await env.spotify.create_playlist(scenario["playlist_name"])

@cb.solve_task
async def solve(env, scenario):
    await env.spotify.search(scenario["song"])
    await env.spotify.add_to_playlist(scenario["playlist_name"])

@cb.evaluate_task
async def evaluate(env, scenario):
    playlist = await env.spotify.get_playlist(scenario["playlist_name"])
    return scenario["song"] in playlist.songs

When executed, Cua-Bench records each action alongside the environment state—capturing HTML snapshots, screenshots, and input events—to produce complete multi-step trajectories for behavioral cloning or supervised learning.

Multi-step long-horizon task trajectory collection

Trajectory Replotting

Record 1 human demonstration → re-render across 10 OS themes = 10 training trajectories.

Same actions, different visual presentations. This is how you build robust cross-platform training data at scale.

Trajectory traces from a Linux VM environment with low-level actions and HTML snapshots

Task Development Workflow

Users define tasks via four decorators:

@tasks_config — scenario variations
@setup_task — environment initialization
@evaluate_task — success verification
@solve_task — oracle solution

JSON-based scenario injection enables thousands of task variations from a single template.

Development workflow for generating tasks, running evals, and collecting trajectory data

Overview of Cua-Bench's task and environment architecture

3. Simulators & Environments

Beyond data generation, Cua-Bench provides full-fledged simulators for RL training.

Benchmark Adapters

Cua-Bench's Python API allows existing benchmarks to be wrapped as adapters:

OSWorld
Windows Agent Arena
MiniWoB++

These adapters delegate to original setup/evaluation code while gaining access to Cua-Bench's variation and recording capabilities.

Shell Applications

Cua-Bench provides simulated shell applications with realistic GUI elements:

Spotify clone — playlists, search, playback
Slack clone — channels, messages, threads
WhatsApp clone — chats, contacts, media

Shell applications with full functionality for agent interaction and testing

Each application is fully configurable:

Appearance (style, color, theme)
Placement (window position, size)
Content (data, state)

Flexible per-application configuration for customizing content, UI elements, and more

Agents can explore, interact, and learn in realistic environments—without spinning up VMs.

Why This Matters

Problem	Cua-Bench Solution
Static VM snapshots	Lightweight webtop environments (single CPU, no virtualization)
Fixed application sets	Randomized installed apps, favorited apps, window layouts
Limited task definitions	Declarative Python API with JSON scenario injection
Slow iteration (rebuild VM images)	Tasks decoupled from VM images
Training data lacks visual diversity	Generate thousands of visual variations automatically
No ground-truth trajectories	Oracle solutions with full HTML + screenshot + input event capture

Getting Started

Full technical report and documentation: cuabench.ai

If you're working on computer-use agents—whether for research or production—we'd love to hear from you.

Citation

@article{cuabench2025,
  title={Cua-Bench: Technical Report},
  author={Cua AI Team},
  year={2025},
  url={https://cuabench.ai}
}