What Coding Benchmarks Really Measure: Limits of SWE-Bench and More

What popular coding benchmarks actually measure

Recent frontier-model reports often headline high scores on coding benchmarks. A closer look at several widely used suites—SWE-bench Verified and Pro, Aider Polyglot, and LiveCodeBench—shows that these benchmarks measure a narrow slice of developer work: small, well-scoped units verified by automated tests. The distinction matters because real-world software engineering is often messier, involving ambiguous requirements, architecture trade-offs, security, and maintainability.

SWE-bench Verified and SWE-bench Pro

What it measures

How well an agent can submit a patch for a real-world GitHub issue that passes the unit tests for that issue.

The specifics

SWE-bench Verified: 500 Python problems, heavily skewed toward open-source libraries (over 40% from Django). Solutions are typically tiny—mean LOC ≈ 11, median ≈ 4—and most edits touch a single function. All issues predate 2024, increasing risk of training-set contamination. Dataset: SWE-bench dataset.
SWE-bench Pro (Scale AI): expands to 1,865 problems across multiple languages, with mean solution size roughly 107 LOC and median 55 LOC, often spanning multiple files. Problems were human-rewritten from issues/PRs and include dockerized environments so dependency setup is out of scope. Relevant materials: Scale AI blog, paper, leaderboard.

Verdict

SWE-bench offers a useful signal for progress on well-defined units of work. It does not measure maintainability, security, or broader product and design skills. Contamination and the reliance on unit tests remain important caveats (see related analyses such as the UTBoost paper: https://arxiv.org/abs/2506.09289).

Aider Polyglot

What it measures

Whether an agent (Aider) can solve Exercism problems across languages and apply edits that pass tests after at most one feedback round.

The specifics

Focuses on Exercism “kata-style” exercises rather than algorithmic puzzles.
Polyglot coverage includes C++, Go, Java, JavaScript, Python, and Rust (225 problems total).
Typical solutions run from roughly 30–200 LOC, often within one or two files.
Evaluation is unit-test driven via the Aider harness. Details: Aider blog post.

Verdict

Aider Polyglot assesses cross-language competence on contained, well-specified tasks. It’s not representative of full SWE responsibilities.

LiveCodeBench

What it measures

Python competitive-programming skills under hidden test suites (LeetCode-style).

The specifics

Tasks include generating solutions, fixing incorrect solutions, and some unusual variants like predicting function outputs.
Balanced mix of easy/medium/hard problems; hidden tests are used for evaluation.
Efforts were made to avoid contamination by using problems released after model cutoffs. Project: http://livecodebench.github.io/

Verdict

LiveCodeBench is a reliable proxy for LeetCode-like performance in Python, not for general SWE work.

Other benchmarks

TerminalBench: focuses on terminal usage (tbench.ai, docs at https://docs.tbench.ai).
SWE-Lancer (OpenAI): maps tasks to economic value and uses E2E tests; limited public reporting recently: https://openai.com/research/swe-lancer.
METR Long-Horizon Benchmark: evaluates time-horizon and messiness for autonomous LLM tasks (https://metr.org/blog/2025-08-12-research-update-towards-reconciling-slowdown-with-time-horizons/).
Polyglot SWE-bench variants: Multi-SWE-bench (ByteDance) and SWE-bench Multilingual expand language coverage (Multi-SWE-bench GitHub, SWE-bench Multilingual).
HumanEval remains cited but covers trivial toy problems and is less relevant today.

Why benchmarking is hard — and why that’s encouraging for coding agents

Designing high-quality benchmarks requires substantial human effort. Automated verification scales easily, so many suites default to unit-test pass rates. That approach is pragmatic but limited: unit tests often miss deeper correctness issues and do not capture the elicitation of requirements, architecture decisions, security, maintainability, and long-term trade-offs central to SWE.

Several promising directions for richer evaluation emerged:

Use generative testing (PBT, fuzzing) instead of or alongside unit tests.
Apply formal methods when feasible.
Validate against automated UATs and end-to-end tests.
Start benchmarks from product-level inputs (PRDs, business context).
Create setups that require agents to acquire missing information or clarify requirements.
Use well-calibrated human judges for fuzzier quality criteria.

Because current benchmarks leave substantial low-hanging fruit, progress on coding agents still has considerable runway. Improved benchmarks and RL environments that capture the messiness of real engineering work could reveal considerably higher practical capability than unit-test pass rates alone suggest.

Original source: https://blog.nilenso.com/blog/2025/09/25/swe-benchmarks/?

TL;DR

What popular coding benchmarks actually measure

SWE-bench Verified and SWE-bench Pro

What it measures

The specifics

Verdict

Aider Polyglot

What it measures

The specifics

Verdict

LiveCodeBench

What it measures

The specifics

Verdict

Other benchmarks

Why benchmarking is hard — and why that’s encouraging for coding agents

Continue the conversation on Slack

Related Articles

Software 2.0: How Neural Networks Reprogram Software Development

Optimize Repositories for AI Coding Agents: Speed, Docs, Commands

Code Supernova: Some Benchmarks