cua-bench: A Framework for Benchmarking, Training Data, and RL Environments for Computer-Use Agents
TL;DR: Current computer-use agents (Claude Computer-Use, OpenAI CUA, Gemini 2.5 Computer-Use) show 10x performance variance across minor UI changes. We're releasing cua-bench—a framework for generating diverse training data, verified trajectories, and RL environments to fix this.
The Problem: 10x Variance Across Minor UI Changes
Recent advances in foundation models have given rise to autonomous agents capable of directly interacting with desktop computing environments: clicking buttons, typing text, navigating applications, and completing complex multi-step workflows.
But there's a problem: these agents are wildly inconsistent.
An agent that completes tasks on a clean desktop fails when windows overlap. Works on default themes, breaks on high contrast mode. Trained on Windows 11, fails on the exact same task in Windows XP.
As demonstrated by Ullrich et al. (OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability), task success rates can differ by over 10x depending on theme, font, or language settings.
The root cause: training data lacks visual diversity.
Current benchmarks like OSWorld and Windows Agent Arena rely on static VM snapshots with fixed configurations:
- Static VMs: Tasks baked into bench-server components, requiring up to 20 minutes to load
- Fixed application sets: VMs come preinstalled with predetermined applications
- Limited task definitions: Tasks defined via JSON with constrained vocabularies
- Slow iteration: Updating tasks requires rebuilding entire VM images
Cua-Bench addresses all of these limitations.
Introducing Cua-Bench
Cua-Bench is a flexible and scalable framework for constructing verifiable, dynamic computer-use environments. It supports:
| Capability | macOS | Linux | Windows | Android | iOS | VM | Webtop |
|---|---|---|---|---|---|---|---|
| UI & GUI Data Generation | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Trajectory Data Generation | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Agentic Benchmarks | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Shell Apps & Simulators | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Real Apps | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ |
1. Scalable GUI Data Generation
Cua-Bench generates realistic and diverse GUI data at scale—customizable across multiple dimensions:
- Different programs and applications
- Window placements and screen coverage
- Different graphic styles, colors, and contrasts
- Different platforms and devices
- Different resolutions (640x480 to 3440x1440)
Examples of data generated by Cua-Bench across multiple OS themes and applications
HTML Snapshots
Beyond raw screenshots, Cua-Bench captures full HTML snapshots of each window along with:
- Bounding box coordinates
- Accessibility labels
- CSS styles
This enables offline rendering and cross-OS replay of captured states.
GUI data with bounding boxes and accessibility labels around UI elements
Cross-Platform Variety
Cua-Bench generates data across desktop environments (Windows, macOS, Linux) and mobile environments (iOS, Android):
Different macOS interfaces with simulated desktop clutter
Android (top) and iOS (bottom) environments
Cross-Time Variety
Most GUI datasets focus exclusively on modern interfaces. This biases agents toward current visual styles, reducing robustness.
Cua-Bench generates data from old and new OS versions alike:
Top: Windows 98 | Bottom: Windows 10 — same tasks, different eras
Resolution Variety
From low-resolution (640x480) to high-resolution (3440x1440), Cua-Bench covers the full range:
High resolution (3440x1440) examples
Low resolution (640x480) examples
2. Agentic Trajectory Generation
Cua-Bench exposes a Playwright-like Python API for defining oracle solutions—programmatic reference implementations that complete tasks step-by-step.
Oracle Solutions
Each task can define a reference solution using a @cb.solve_task decorator:
@cb.tasks_config
def config():
return {
"scenarios": [
{"playlist_name": "Workout Mix", "song": "Eye of the Tiger"},
{"playlist_name": "Chill Vibes", "song": "Weightless"},
# ... thousands of variations from a single template
]
}
@cb.setup_task
async def setup(env, scenario):
await env.spotify.open()
await env.spotify.create_playlist(scenario["playlist_name"])
@cb.solve_task
async def solve(env, scenario):
await env.spotify.search(scenario["song"])
await env.spotify.add_to_playlist(scenario["playlist_name"])
@cb.evaluate_task
async def evaluate(env, scenario):
playlist = await env.spotify.get_playlist(scenario["playlist_name"])
return scenario["song"] in playlist.songs
When executed, Cua-Bench records each action alongside the environment state—capturing HTML snapshots, screenshots, and input events—to produce complete multi-step trajectories for behavioral cloning or supervised learning.
Multi-step long-horizon task trajectory collection
Trajectory Replotting
Record 1 human demonstration → re-render across 10 OS themes = 10 training trajectories.
Same actions, different visual presentations. This is how you build robust cross-platform training data at scale.
Trajectory traces from a Linux VM environment with low-level actions and HTML snapshots
Task Development Workflow
Users define tasks via four decorators:
@tasks_config— scenario variations@setup_task— environment initialization@evaluate_task— success verification@solve_task— oracle solution
JSON-based scenario injection enables thousands of task variations from a single template.
Development workflow for generating tasks, running evals, and collecting trajectory data
Overview of Cua-Bench's task and environment architecture
3. Simulators & Environments
Beyond data generation, Cua-Bench provides full-fledged simulators for RL training.
Benchmark Adapters
Cua-Bench's Python API allows existing benchmarks to be wrapped as adapters:
- OSWorld
- Windows Agent Arena
- MiniWoB++
These adapters delegate to original setup/evaluation code while gaining access to Cua-Bench's variation and recording capabilities.
Shell Applications
Cua-Bench provides simulated shell applications with realistic GUI elements:
- Spotify clone — playlists, search, playback
- Slack clone — channels, messages, threads
- WhatsApp clone — chats, contacts, media
Shell applications with full functionality for agent interaction and testing
Each application is fully configurable:
- Appearance (style, color, theme)
- Placement (window position, size)
- Content (data, state)
Flexible per-application configuration for customizing content, UI elements, and more
Agents can explore, interact, and learn in realistic environments—without spinning up VMs.
Why This Matters
| Problem | Cua-Bench Solution |
|---|---|
| Static VM snapshots | Lightweight webtop environments (single CPU, no virtualization) |
| Fixed application sets | Randomized installed apps, favorited apps, window layouts |
| Limited task definitions | Declarative Python API with JSON scenario injection |
| Slow iteration (rebuild VM images) | Tasks decoupled from VM images |
| Training data lacks visual diversity | Generate thousands of visual variations automatically |
| No ground-truth trajectories | Oracle solutions with full HTML + screenshot + input event capture |
Getting Started
Full technical report and documentation: cuabench.ai
If you're working on computer-use agents—whether for research or production—we'd love to hear from you.
Citation
@article{cuabench2025,
title={Cua-Bench: Technical Report},
author={Cua AI Team},
year={2025},
url={https://cuabench.ai}
}
Links
- Technical Report: cuabench.ai
- Twitter/X: @trycua
If you're a research lab or researcher working on computer-use agents, we'd love to hear from you. Reach out via DM or sign up on the website for early access.
