Feature-Oriented Agentic Coding Benchmark

FeatureBench: Beyond bug fixing. Ship real features.

End-to-end benchmarking of real-world feature development.

Tasks from real repositories, evaluated by executable tests.

Best %Resolved (Full)

--

Loading latest results...

Best %Passed (Full)

--

Loading latest results...

Latest Update

--

Reading local benchmark data...

Collection Pipeline

Scalable, test-driven instance construction from repositories

FeatureBench builds feature-oriented tasks using an execution-based, test-driven pipeline. It selects fail-to-pass and pass-to-pass tests, traces function dependencies, extracts feature patches, and post-verifies each instance to ensure reproducible and continually refreshable evaluation.

  • Dependency graphs are built via dynamic tracing instead of relying solely on static heuristics.
  • Post-verification enforces fail-to-pass and pass-to-pass conditions before patch replay.
  • Problem statements are synthesized with explicit interfaces and import paths.

FeatureBench

Full Set

200 tasks

Hover a slice to inspect repository and count.

Dataset Composition by Repository (Full Split)

Empirical Results

Frontier agents still face a large feature-development gap

In the paper baseline, Claude Opus 4.5 reaches 74.4% resolved on SWE-bench but only 11.0% on FeatureBench Full, while GPT-5.1-Codex reaches 12.5%. The gap suggests current agents are still far from stable real-world feature shipping.

200

Full tasks

30

Lite tasks

3825

Environments

24

Repositories

Scaffold Model Lite Full
%Passed %Resolved Token I/O %Passed %Resolved Token I/O
OpenHands Claude Opus 4.5 67.18 20.0 8.8M / 29k 45.53 10.5 8.1M / 29k
Codex GPT-5.1-Codex (medium reasoning) 60.22 20.0 6.6M / 39k 41.66 12.5 6.3M / 39k
Claude Code1 Claude Opus 4.5 59.12 20.0 9.0M / 35k 43.29 11.0 7.5M / 34k
Gemini-CLI Gemini-3-Pro-Preview (low reasoning) 43.38 10.0 2.6M / 13k 32.43 5.0 2.5M / 12k
OpenHands Gemini-3-Pro-Preview (low reasoning) 45.14 10.0 6.0M / 41k 30.08 4.5 6.2M / 40k
OpenHands DeepSeek-V3.2 35.94 6.7 3.1M / 24k 26.30 5.5 3.1M / 23k
OpenHands Qwen3-Coder-480B-A35B-Instruct 38.31 6.7 2.6M / 16k 24.55 3.5 2.0M / 14k

1 Routing mode may route operations to different models (e.g., Claude Haiku), even if a specific model is selected.

Complexity & Time Analysis

Pass rate changes across code complexity bands and task creation periods

We analyze two empirical slices: pass rate versus pending implementation lines, and pass rate versus task creation time.

Pass Rate vs Lines

Bins reflect ranges of pending implementation lines.

Pass Rate vs Create Time

Quarterly cohorts from pre-2023 through 2025 Q4.

Citation

Use one of the following formats when citing FeatureBench.

BibTeX

@misc{zhou2026featurebenchbenchmarkingagenticcoding,
  title={FeatureBench: Benchmarking Agentic Coding for Complex Feature Development},
  author={Qixing Zhou and Jiacheng Zhang and Haiyang Wang and Rui Hao and Jiahe Wang and Minghao Han and Yuxue Yang and Shuzhe Wu and Feiyang Pan and Lue Fan and Dandan Tu and Zhaoxiang Zhang},
  year={2026},
  eprint={2602.10975},
  archivePrefix={arXiv},
  primaryClass={cs.SE},
  url={https://arxiv.org/abs/2602.10975}
}

APA


              

MLA


              
Visitors: --