Feature-Oriented Agentic Coding Benchmark

FeatureBench: Beyond bug fixing. Ship real features.

End-to-end benchmarking of real-world feature development.

Tasks from real repositories, evaluated by executable tests.

Best %Resolved (Full)

--

Loading latest results...

Best %Passed (Full)

--

Loading latest results...

Latest Update

--

Reading local benchmark data...

Collection Pipeline

Scalable, test-driven instance construction from repositories

FeatureBench builds feature-oriented tasks using an execution-based, test-driven pipeline. It selects fail-to-pass and pass-to-pass tests, traces function dependencies, extracts feature patches, and post-verifies each instance to ensure reproducible and continually refreshable evaluation.

Dependency graphs are built via dynamic tracing instead of relying solely on static heuristics.
Post-verification enforces fail-to-pass and pass-to-pass conditions before patch replay.
Problem statements are synthesized with explicit interfaces and import paths.

FeatureBench

Full Set

200 tasks

Hover a slice to inspect repository and count.

Dataset Composition by Repository (Full Split)

Empirical Results

Frontier agents still face a large feature-development gap

In the paper baseline, Claude Opus 4.5 reaches 74.4% resolved on SWE-bench but only 11.0% on FeatureBench Full, while GPT-5.1-Codex reaches 12.5%. The gap suggests current agents are still far from stable real-world feature shipping.

200

Full tasks

30

Lite tasks

3825

Environments

24

Repositories

Scaffold	Model	Lite			Full
Scaffold	Model	%Passed	%Resolved	Token I/O	%Passed	%Resolved	Token I/O
OpenHands	Claude Opus 4.5	67.18	20.0	8.8M / 29k	45.53	10.5	8.1M / 29k
Codex	GPT-5.1-Codex (medium reasoning)	60.22	20.0	6.6M / 39k	41.66	12.5	6.3M / 39k
Claude Code¹	Claude Opus 4.5	59.12	20.0	9.0M / 35k	43.29	11.0	7.5M / 34k
Gemini-CLI	Gemini-3-Pro-Preview (low reasoning)	43.38	10.0	2.6M / 13k	32.43	5.0	2.5M / 12k
OpenHands	Gemini-3-Pro-Preview (low reasoning)	45.14	10.0	6.0M / 41k	30.08	4.5	6.2M / 40k
OpenHands	DeepSeek-V3.2	35.94	6.7	3.1M / 24k	26.30	5.5	3.1M / 23k
OpenHands	Qwen3-Coder-480B-A35B-Instruct	38.31	6.7	2.6M / 16k	24.55	3.5	2.0M / 14k

¹ Routing mode may route operations to different models (e.g., Claude Haiku), even if a specific model is selected.

Complexity & Time Analysis

Pass rate changes across code complexity bands and task creation periods

We analyze two empirical slices: pass rate versus pending implementation lines, and pass rate versus task creation time.

Pass Rate vs Lines

Bins reflect ranges of pending implementation lines.

Pass Rate vs Create Time

Quarterly cohorts from pre-2023 through 2025 Q4.

Citation

Use one of the following formats when citing FeatureBench.

BibTeX

@misc{zhou2026featurebenchbenchmarkingagenticcoding,
  title={FeatureBench: Benchmarking Agentic Coding for Complex Feature Development},
  author={Qixing Zhou and Jiacheng Zhang and Haiyang Wang and Rui Hao and Jiahe Wang and Minghao Han and Yuxue Yang and Shuzhe Wu and Feiyang Pan and Lue Fan and Dandan Tu and Zhaoxiang Zhang},
  year={2026},
  eprint={2602.10975},
  archivePrefix={arXiv},
  primaryClass={cs.SE},
  url={https://arxiv.org/abs/2602.10975}
}

APA

MLA

Visitors: --