HN
Today

ProgramBench: Can Language Models Rebuild Programs from Scratch?

A new benchmark, ProgramBench, evaluates if language models can truly build software from scratch, revealing significant architectural and implementation shortcomings. Current LMs struggle with holistic software engineering, favoring monolithic designs and failing to fully resolve even a single task. This paper offers a much-needed reality check on the ambitious claims of AI's autonomous coding prowess, sparking interest in the practical limitations of current models.

5
Score
0
Comments
#7
Highest Rank
6h
on Front Page
First Seen
May 7, 5:00 AM
Last Seen
May 7, 10:00 AM
Rank Over Time
237871212

The Lowdown

A recent paper introduces ProgramBench, a novel benchmark designed to rigorously assess language models' capabilities in holistic software engineering. Moving beyond narrow tasks like bug fixing or single-feature development, ProgramBench challenges AI agents to architect and implement entire software projects from documentation alone.

  • Problem Statement: Existing benchmarks for language models in code generation focus on isolated tasks, failing to evaluate their ability to design and build complete software systems.
  • ProgramBench's Approach: Given a program's documentation, agents must recreate its functionality. Evaluation uses agent-driven fuzzing to generate end-to-end behavioral tests, eliminating bias from prescribed implementation structures.
  • Scope of Tasks: The benchmark includes 200 diverse tasks, ranging from simple command-line tools to complex, widely used software like FFmpeg, SQLite, and the PHP interpreter.
  • Key Findings: Across 9 evaluated LMs, none were able to fully resolve any task. The best performing model passed 95% of tests on only 3% of tasks, demonstrating a significant gap in capability.
  • Architectural Deficiencies: Models consistently produced monolithic, single-file implementations, which starkly contrasts with the modular, multi-file architectures typical of human-written code.

In essence, ProgramBench underscores that while large language models can perform well on specific coding challenges, their ability to handle the architectural and integrative complexities of full-scale software development remains severely limited, often resulting in impractical and unmaintainable designs.