Many SWE-bench-Passing PRs would not be merged

A new study reveals that roughly half of AI-generated code patches, even those passing automated benchmarks like SWE-bench, would be rejected by human maintainers due to issues like code quality or functionality. This challenges a naive interpretation of AI coding benchmarks, suggesting they significantly overstate real-world usefulness by failing to account for the nuanced demands of human review. The findings spark discussions on the adequacy of current AI evaluation metrics and the persistent gap between automated testing and practical software development standards.

103

Score

Comments

Highest Rank

13h

on Front Page

First Seen

Mar 11, 10:00 PM

Last Seen

Mar 12, 10:00 AM

Rank Over Time

The Lowdown

A study conducted by METR investigates the crucial gap between AI's performance on automated coding benchmarks and its actual utility in real-world software development workflows. Specifically focusing on the SWE-bench Verified benchmark, the research sought to determine how many AI-generated Pull Requests (PRs) that pass automated tests would actually be merged by human maintainers.

Researchers recruited four active maintainers from three SWE-bench Verified repositories (scikit-learn, Sphinx, pytest) to review 296 AI-generated PRs, along with 47 human-written "golden patches" as a baseline.
The central finding indicates that AI-generated PRs, despite passing the automated SWE-bench grader, were accepted by human maintainers at a rate approximately 24 percentage points lower than their automated pass rate.
Maintainers frequently rejected AI PRs due to poor code quality (style, standards), breaking other code, or fundamental functionality failures, even when automated tests passed.
While AI models are improving, the study suggests the rate of improvement in human-approved merge decisions is slower than what automated benchmark scores imply, though this trend finding is less robust.
The study acknowledges several limitations, including using a subset of a single benchmark, a not-fully-realistic review process (e.g., no CI), and a static comparison rather than an iterative human-like development process.
Ultimately, the research cautions against over-reliance on automated benchmarks as a sole indicator of AI's real-world coding capabilities, urging a more nuanced view for forecasting AI progress.

The study underscores that real-world software engineering involves more than just passing tests; it requires adherence to coding standards, architectural consistency, and human-centric maintainability, aspects that current automated benchmarks largely overlook.

The Gossip

Benchmark Blinders: The Chasm Between Scores and Usability

Commenters largely concur with the article's premise: automated benchmarks like SWE-bench, while valuable for specific tasks, fail to evaluate critical aspects of real-world software development. They highlight that human maintainers consider factors beyond passing tests, such as code style, architectural fit, maintainability, intent alignment, and team-specific preferences, which current automated evaluations cannot capture. This leads to a significant discrepancy between benchmark success and actual mergeability. Some suggest that SWE-bench might be an inadequate test if it doesn't align with maintainer expectations.

Slop or Scrutiny: Dissecting AI Code Review

The discussion grapples with the interplay between potentially low-quality AI-generated code and human review biases. Some argue that maintainers might harbor prejudice against AI code, leading to harsher reviews, especially if they know it's AI-generated (even if blinded, AI 'slop' can be obvious). Others contend that AI-generated 'slop' is often inherently recognizable due to its boilerplate nature or lack of deep understanding, making reviewer bias less of a factor than the actual code quality deficiencies. The consensus is that AI models tend to produce code that 'appears to work' rather than being robust or well-integrated, leading to necessary rejections.