cua-bench: A Framework for Benchmarking, Training Data, and RL Environments for Computer-Use Agents

Community Article Published December 16, 2025

Banner

TL;DR: Current computer-use agents (Claude Computer-Use, OpenAI CUA, Gemini 2.5 Computer-Use) show 10x performance variance across minor UI changes. We're releasing cua-bench—a framework for generating diverse training data, verified trajectories, and RL environments to fix this.


The Problem: 10x Variance Across Minor UI Changes

Recent advances in foundation models have given rise to autonomous agents capable of directly interacting with desktop computing environments: clicking buttons, typing text, navigating applications, and completing complex multi-step workflows.

But there's a problem: these agents are wildly inconsistent.

An agent that completes tasks on a clean desktop fails when windows overlap. Works on default themes, breaks on high contrast mode. Trained on Windows 11, fails on the exact same task in Windows XP.

As demonstrated by Ullrich et al. (OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability), task success rates can differ by over 10x depending on theme, font, or language settings.

The root cause: training data lacks visual diversity.

Current benchmarks like OSWorld and Windows Agent Arena rely on static VM snapshots with fixed configurations:

  • Static VMs: Tasks baked into bench-server components, requiring up to 20 minutes to load
  • Fixed application sets: VMs come preinstalled with predetermined applications
  • Limited task definitions: Tasks defined via JSON with constrained vocabularies
  • Slow iteration: Updating tasks requires rebuilding entire VM images

Cua-Bench addresses all of these limitations.


Introducing Cua-Bench

Cua-Bench is a flexible and scalable framework for constructing verifiable, dynamic computer-use environments. It supports:

Capability macOS Linux Windows Android iOS VM Webtop
UI & GUI Data Generation
Trajectory Data Generation
Agentic Benchmarks
Shell Apps & Simulators
Real Apps

1. Scalable GUI Data Generation

Cua-Bench generates realistic and diverse GUI data at scale—customizable across multiple dimensions:

  • Different programs and applications
  • Window placements and screen coverage
  • Different graphic styles, colors, and contrasts
  • Different platforms and devices
  • Different resolutions (640x480 to 3440x1440)

Examples of data generated by Cua-Bench Examples of data generated by Cua-Bench across multiple OS themes and applications

HTML Snapshots

Beyond raw screenshots, Cua-Bench captures full HTML snapshots of each window along with:

  • Bounding box coordinates
  • Accessibility labels
  • CSS styles

This enables offline rendering and cross-OS replay of captured states.

GUI data with bounding boxes GUI data with bounding boxes and accessibility labels around UI elements

Cross-Platform Variety

Cua-Bench generates data across desktop environments (Windows, macOS, Linux) and mobile environments (iOS, Android):

macOS interfaces Different macOS interfaces with simulated desktop clutter

Android and iOS Android (top) and iOS (bottom) environments

Cross-Time Variety

Most GUI datasets focus exclusively on modern interfaces. This biases agents toward current visual styles, reducing robustness.

Cua-Bench generates data from old and new OS versions alike:

Windows 98 and Windows 10 Top: Windows 98 | Bottom: Windows 10 — same tasks, different eras

Resolution Variety

From low-resolution (640x480) to high-resolution (3440x1440), Cua-Bench covers the full range:

High resolution examples High resolution (3440x1440) examples

Low resolution examples Low resolution (640x480) examples


2. Agentic Trajectory Generation

Cua-Bench exposes a Playwright-like Python API for defining oracle solutions—programmatic reference implementations that complete tasks step-by-step.

Oracle Solutions

Each task can define a reference solution using a @cb.solve_task decorator:

@cb.tasks_config
def config():
    return {
        "scenarios": [
            {"playlist_name": "Workout Mix", "song": "Eye of the Tiger"},
            {"playlist_name": "Chill Vibes", "song": "Weightless"},
            # ... thousands of variations from a single template
        ]
    }

@cb.setup_task
async def setup(env, scenario):
    await env.spotify.open()
    await env.spotify.create_playlist(scenario["playlist_name"])

@cb.solve_task
async def solve(env, scenario):
    await env.spotify.search(scenario["song"])
    await env.spotify.add_to_playlist(scenario["playlist_name"])

@cb.evaluate_task
async def evaluate(env, scenario):
    playlist = await env.spotify.get_playlist(scenario["playlist_name"])
    return scenario["song"] in playlist.songs

When executed, Cua-Bench records each action alongside the environment state—capturing HTML snapshots, screenshots, and input events—to produce complete multi-step trajectories for behavioral cloning or supervised learning.

Trajectory generation Multi-step long-horizon task trajectory collection

Trajectory Replotting

Record 1 human demonstration → re-render across 10 OS themes = 10 training trajectories.

Same actions, different visual presentations. This is how you build robust cross-platform training data at scale.

Trajectory traces Trajectory traces from a Linux VM environment with low-level actions and HTML snapshots

Task Development Workflow

Users define tasks via four decorators:

  • @tasks_config — scenario variations
  • @setup_task — environment initialization
  • @evaluate_task — success verification
  • @solve_task — oracle solution

JSON-based scenario injection enables thousands of task variations from a single template.

Development workflow Development workflow for generating tasks, running evals, and collecting trajectory data

Task and environment architecture

Overview of Cua-Bench's task and environment architecture


3. Simulators & Environments

Beyond data generation, Cua-Bench provides full-fledged simulators for RL training.

Benchmark Adapters

Cua-Bench's Python API allows existing benchmarks to be wrapped as adapters:

  • OSWorld
  • Windows Agent Arena
  • MiniWoB++

These adapters delegate to original setup/evaluation code while gaining access to Cua-Bench's variation and recording capabilities.

Shell Applications

Cua-Bench provides simulated shell applications with realistic GUI elements:

  • Spotify clone — playlists, search, playback
  • Slack clone — channels, messages, threads
  • WhatsApp clone — chats, contacts, media

Spotify and WhatsApp simulators Shell applications with full functionality for agent interaction and testing

Each application is fully configurable:

  • Appearance (style, color, theme)
  • Placement (window position, size)
  • Content (data, state)

Per-app configuration Flexible per-application configuration for customizing content, UI elements, and more

Agents can explore, interact, and learn in realistic environments—without spinning up VMs.


Why This Matters

Problem Cua-Bench Solution
Static VM snapshots Lightweight webtop environments (single CPU, no virtualization)
Fixed application sets Randomized installed apps, favorited apps, window layouts
Limited task definitions Declarative Python API with JSON scenario injection
Slow iteration (rebuild VM images) Tasks decoupled from VM images
Training data lacks visual diversity Generate thousands of visual variations automatically
No ground-truth trajectories Oracle solutions with full HTML + screenshot + input event capture

Getting Started

Full technical report and documentation: cuabench.ai

If you're working on computer-use agents—whether for research or production—we'd love to hear from you.


Citation

@article{cuabench2025,
  title={Cua-Bench: Technical Report},
  author={Cua AI Team},
  year={2025},
  url={https://cuabench.ai}
}

Links


If you're a research lab or researcher working on computer-use agents, we'd love to hear from you. Reach out via DM or sign up on the website for early access.

Community

Sign up or log in to comment