What Coding Benchmarks Really Measure: Limits of SWE-Bench and More

Benchmarks like SWE-bench, Aider Polyglot and LiveCodeBench mainly test small, unit-test-verified tasks, not architecture, security or maintainability. He urges fuzzing, UATs and human review.

tool cover

TL;DR

What popular coding benchmarks actually measure

Recent frontier-model reports often headline high scores on coding benchmarks. A closer look at several widely used suites—SWE-bench Verified and Pro, Aider Polyglot, and LiveCodeBench—shows that these benchmarks measure a narrow slice of developer work: small, well-scoped units verified by automated tests. The distinction matters because real-world software engineering is often messier, involving ambiguous requirements, architecture trade-offs, security, and maintainability.

SWE-bench Verified and SWE-bench Pro

What it measures

How well an agent can submit a patch for a real-world GitHub issue that passes the unit tests for that issue.

The specifics

  • SWE-bench Verified: 500 Python problems, heavily skewed toward open-source libraries (over 40% from Django). Solutions are typically tiny—mean LOC ≈ 11, median ≈ 4—and most edits touch a single function. All issues predate 2024, increasing risk of training-set contamination. Dataset: SWE-bench dataset.
  • SWE-bench Pro (Scale AI): expands to 1,865 problems across multiple languages, with mean solution size roughly 107 LOC and median 55 LOC, often spanning multiple files. Problems were human-rewritten from issues/PRs and include dockerized environments so dependency setup is out of scope. Relevant materials: Scale AI blog, paper, leaderboard.

Verdict

SWE-bench offers a useful signal for progress on well-defined units of work. It does not measure maintainability, security, or broader product and design skills. Contamination and the reliance on unit tests remain important caveats (see related analyses such as the UTBoost paper: https://arxiv.org/abs/2506.09289).

Aider Polyglot

What it measures

Whether an agent (Aider) can solve Exercism problems across languages and apply edits that pass tests after at most one feedback round.

The specifics

  • Focuses on Exercism “kata-style” exercises rather than algorithmic puzzles.
  • Polyglot coverage includes C++, Go, Java, JavaScript, Python, and Rust (225 problems total).
  • Typical solutions run from roughly 30–200 LOC, often within one or two files.
  • Evaluation is unit-test driven via the Aider harness. Details: Aider blog post.

Verdict

Aider Polyglot assesses cross-language competence on contained, well-specified tasks. It’s not representative of full SWE responsibilities.

LiveCodeBench

What it measures

Python competitive-programming skills under hidden test suites (LeetCode-style).

The specifics

  • Tasks include generating solutions, fixing incorrect solutions, and some unusual variants like predicting function outputs.
  • Balanced mix of easy/medium/hard problems; hidden tests are used for evaluation.
  • Efforts were made to avoid contamination by using problems released after model cutoffs. Project: http://livecodebench.github.io/

Verdict

LiveCodeBench is a reliable proxy for LeetCode-like performance in Python, not for general SWE work.

Other benchmarks

Why benchmarking is hard — and why that’s encouraging for coding agents

Designing high-quality benchmarks requires substantial human effort. Automated verification scales easily, so many suites default to unit-test pass rates. That approach is pragmatic but limited: unit tests often miss deeper correctness issues and do not capture the elicitation of requirements, architecture decisions, security, maintainability, and long-term trade-offs central to SWE.

Several promising directions for richer evaluation emerged:

  • Use generative testing (PBT, fuzzing) instead of or alongside unit tests.
  • Apply formal methods when feasible.
  • Validate against automated UATs and end-to-end tests.
  • Start benchmarks from product-level inputs (PRDs, business context).
  • Create setups that require agents to acquire missing information or clarify requirements.
  • Use well-calibrated human judges for fuzzier quality criteria.

Because current benchmarks leave substantial low-hanging fruit, progress on coding agents still has considerable runway. Improved benchmarks and RL environments that capture the messiness of real engineering work could reveal considerably higher practical capability than unit-test pass rates alone suggest.

Original source: https://blog.nilenso.com/blog/2025/09/25/swe-benchmarks/?

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community