You Hired the AI to Write the Tests. Of Course They Pass
As AI code generation becomes ubiquitous, a critical problem emerges: how to trust code when the AI also writes its tests, creating a 'self-congratulation machine.' This article proposes a solution by adapting Test-Driven Development principles, advocating for explicitly defined acceptance criteria before code generation. It details a practical workflow and a tool (claude-verify) to automate validation against these criteria, shifting developer focus from reviewing diffs to verifying failures.
The Lowdown
The author identifies a significant challenge in modern AI-driven development: the difficulty of trusting code generated by autonomous agents when those same agents also create the tests. This scenario often results in tests that validate the AI's understanding rather than the developer's true intent, leading to a system that 'self-congratulates' on passing its own, potentially flawed, checks. The core issue is that reviewing AI-generated code and tests becomes unsustainable and ineffective.
Here's a breakdown of the proposed solution and its implementation:
- The Problem with AI-Generated Tests: When an AI writes both the code and the tests, it validates its own interpretation of requirements, not necessarily the user's actual desired outcome. This creates a blind spot for original misunderstandings, similar to relying on a single author for both code and review.
- Revisiting TDD: The article suggests that Test-Driven Development (TDD) principles offer a path forward. By defining what "correct" looks like (via acceptance criteria) before the AI generates the code, the developer forces a clear specification that the AI must then satisfy and the verification system can check.
- Practical Implementation: For frontend features, acceptance criteria (ACs) are created for specific behaviors (e.g., successful login, error messages, empty field validation, rate limiting). For backend, observable API behaviors (status codes, headers, error messages) are specified.
- Automated Verification: Once ACs are defined, an agent builds the feature. A separate verification system then runs automated checks (e.g., Playwright browser agents for frontend, curl commands for backend) against each AC, producing a report with verdicts (pass/fail/needs-human-review).
- Workflow Shift: This approach changes the developer's role from reviewing potentially massive code diffs to reviewing only the failures reported by the verification system, making the process more efficient and targeted.
- The
claude-verifyTool: The author built a Claude Skill that orchestrates this process usingclaude -p(headless mode) and Playwright. It comprises four stages: Pre-flight (bash checks), Planner (Opus call to strategize checks), Browser Agents (parallel Sonnet calls per AC for execution), and Judge (final Opus call to interpret evidence and issue verdicts).
In essence, the article argues that to truly trust autonomous coding agents, developers must front-load the effort of defining clear, objective acceptance criteria. This disciplined approach, though initially feeling slower, ensures that the AI's output is rigorously validated against human-defined standards, moving beyond mere self-congratulation to genuine correctness.