Best %Resolved (Full)
--
Loading latest results...
Feature-Oriented Agentic Coding Benchmark
End-to-end benchmarking of real-world feature development.
Tasks from real repositories, evaluated by executable tests.
Best %Resolved (Full)
--
Loading latest results...
Best %Passed (Full)
--
Loading latest results...
Latest Update
--
Reading local benchmark data...
Collection Pipeline
FeatureBench builds feature-oriented tasks using an execution-based, test-driven pipeline. It selects fail-to-pass and pass-to-pass tests, traces function dependencies, extracts feature patches, and post-verifies each instance to ensure reproducible and continually refreshable evaluation.
FeatureBench
Full Set
200 tasks
Hover a slice to inspect repository and count.
Empirical Results
In the paper baseline, Claude Opus 4.5 reaches 74.4% resolved on SWE-bench but only 11.0% on FeatureBench Full, while GPT-5.1-Codex reaches 12.5%. The gap suggests current agents are still far from stable real-world feature shipping.
200
Full tasks
30
Lite tasks
3825
Environments
24
Repositories
| Scaffold | Model | Lite | Full | ||||
|---|---|---|---|---|---|---|---|
| %Passed | %Resolved | Token I/O | %Passed | %Resolved | Token I/O | ||
| OpenHands | Claude Opus 4.5 | 67.18 | 20.0 | 8.8M / 29k | 45.53 | 10.5 | 8.1M / 29k |
| Codex | GPT-5.1-Codex (medium reasoning) | 60.22 | 20.0 | 6.6M / 39k | 41.66 | 12.5 | 6.3M / 39k |
| Claude Code1 | Claude Opus 4.5 | 59.12 | 20.0 | 9.0M / 35k | 43.29 | 11.0 | 7.5M / 34k |
| Gemini-CLI | Gemini-3-Pro-Preview (low reasoning) | 43.38 | 10.0 | 2.6M / 13k | 32.43 | 5.0 | 2.5M / 12k |
| OpenHands | Gemini-3-Pro-Preview (low reasoning) | 45.14 | 10.0 | 6.0M / 41k | 30.08 | 4.5 | 6.2M / 40k |
| OpenHands | DeepSeek-V3.2 | 35.94 | 6.7 | 3.1M / 24k | 26.30 | 5.5 | 3.1M / 23k |
| OpenHands | Qwen3-Coder-480B-A35B-Instruct | 38.31 | 6.7 | 2.6M / 16k | 24.55 | 3.5 | 2.0M / 14k |
1 Routing mode may route operations to different models (e.g., Claude Haiku), even if a specific model is selected.
Complexity & Time Analysis
We analyze two empirical slices: pass rate versus pending implementation lines, and pass rate versus task creation time.
Bins reflect ranges of pending implementation lines.
Quarterly cohorts from pre-2023 through 2025 Q4.
Citation
Use one of the following formats when citing FeatureBench.
@misc{zhou2026featurebenchbenchmarkingagenticcoding,
title={FeatureBench: Benchmarking Agentic Coding for Complex Feature Development},
author={Qixing Zhou and Jiacheng Zhang and Haiyang Wang and Rui Hao and Jiahe Wang and Minghao Han and Yuxue Yang and Shuzhe Wu and Feiyang Pan and Lue Fan and Dandan Tu and Zhaoxiang Zhang},
year={2026},
eprint={2602.10975},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2602.10975}
}