✦ Benchmarking Personal Assistant Agents

Benchmarking the always‑on assistant
across the user's digital world

Claw‑Anything places agents in a context‑rich, noisy digital life — months of activity, dozens of interdependent services, and coordinated CLI + GUI action across devices. Even the strongest frontier model passes only 34.5% of tasks on the first try.

View the leaderboard → 📄 Read the paper ★ GitHub
0
Eval Tasks
0
Train Envs
0
Backend Services
0
Task Categories
0
Activity Horizon
0
Services / Task
Main Results

Frontier model leaderboard

Open- and closed-source models evaluated under a unified OpenHarness scaffold for a fair, apples-to-apples comparison. marks the best result in each column. Click any header to re-sort; rows respect the active filter.

The Benchmark

Three dimensions of expanded context

Existing benchmarks expose only a narrow slice of the user's world. Claw‑Anything widens the agent's perceptual scope along three axes — then injects realistic noise, conflicts, and distractions on top.

🕒

Long-horizon event streams

Three+ months of fine-grained, system- and service-level activity logs connect past and present. Agents must mine task-relevant signals from 83.7k words of history.

temporal reasoning
🧩

Interdependent services

35 backend services span work, lifestyle and social spaces; an average task touches 10.1 of them. Masking cross-service tools collapses success to near zero.

cross-service coordination
📱

Multi-device CLI + GUI

Tasks coordinate Linux CLI containers and Android GUI emulators. 50 of 200 eval tasks demand joint CLI+GUI action — by far the hardest split.

heterogeneous interfaces

Reactive + proactive. Beyond explicit requests, Claw‑Anything adopts a heartbeat mechanism: the agent periodically monitors the user's environment and must surface timely, contextually grounded recommendations without being prompted — testing genuine always-on assistance.

Key Findings

A pronounced capability gap

Bringing the agent's perceptual scope closer to the user's makes the benchmark markedly harder — success now requires both accurate understanding of the environment and correct grounded action.

34.5%

Best Pass@1 (GPT-5.5)

The strongest frontier model still fails roughly two-thirds of tasks on the first attempt — far below scores on prior, narrower benchmarks.

20.0%

Pass^3 ceiling

Requiring success across all three independent trials, no model clears 20%. Reliability under broad context remains the core challenge.

+23.7%

Gain from our pipeline

Post-training Qwen3.5-27B on 1,500 trajectories from our auto-generated environments lifts Pass@1 by 23.7 points — rivalling closed-source frontiers.

🔎

The investigation–execution gap. Across every model, the dominant failure mode is the same: agents locate the relevant context yet fail to translate that understanding into successful action. Execution — not perception — is the primary bottleneck in broad, always-on digital environments.