Claw‑Anything places agents in a context‑rich, noisy digital life — months of activity, dozens of interdependent services, and coordinated CLI + GUI action across devices. Even the strongest frontier model passes only 34.5% of tasks on the first try.
Open- and closed-source models evaluated under a unified OpenHarness scaffold for a fair, apples-to-apples comparison. ★ marks the best result in each column. Click any header to re-sort; rows respect the active filter.
Existing benchmarks expose only a narrow slice of the user's world. Claw‑Anything widens the agent's perceptual scope along three axes — then injects realistic noise, conflicts, and distractions on top.
Three+ months of fine-grained, system- and service-level activity logs connect past and present. Agents must mine task-relevant signals from 83.7k words of history.
temporal reasoning35 backend services span work, lifestyle and social spaces; an average task touches 10.1 of them. Masking cross-service tools collapses success to near zero.
cross-service coordinationTasks coordinate Linux CLI containers and Android GUI emulators. 50 of 200 eval tasks demand joint CLI+GUI action — by far the hardest split.
heterogeneous interfacesReactive + proactive. Beyond explicit requests, Claw‑Anything adopts a heartbeat mechanism: the agent periodically monitors the user's environment and must surface timely, contextually grounded recommendations without being prompted — testing genuine always-on assistance.
Bringing the agent's perceptual scope closer to the user's makes the benchmark markedly harder — success now requires both accurate understanding of the environment and correct grounded action.
The strongest frontier model still fails roughly two-thirds of tasks on the first attempt — far below scores on prior, narrower benchmarks.
Requiring success across all three independent trials, no model clears 20%. Reliability under broad context remains the core challenge.
Post-training Qwen3.5-27B on 1,500 trajectories from our auto-generated environments lifts Pass@1 by 23.7 points — rivalling closed-source frontiers.
The investigation–execution gap. Across every model, the dominant failure mode is the same: agents locate the relevant context yet fail to translate that understanding into successful action. Execution — not perception — is the primary bottleneck in broad, always-on digital environments.