ProgramBench: Can Language Models Rebuild Programs from Scratch?

A recent paper introduces ProgramBench, a novel benchmark designed to rigorously assess language models' capabilities in holistic software engineering. Moving beyond narrow tasks like bug fixing or single-feature development, ProgramBench challenges AI agents to architect and implement entire software projects from documentation alone.

Problem Statement: Existing benchmarks for language models in code generation focus on isolated tasks, failing to evaluate their ability to design and build complete software systems.
ProgramBench's Approach: Given a program's documentation, agents must recreate its functionality. Evaluation uses agent-driven fuzzing to generate end-to-end behavioral tests, eliminating bias from prescribed implementation structures.
Scope of Tasks: The benchmark includes 200 diverse tasks, ranging from simple command-line tools to complex, widely used software like FFmpeg, SQLite, and the PHP interpreter.
Key Findings: Across 9 evaluated LMs, none were able to fully resolve any task. The best performing model passed 95% of tests on only 3% of tasks, demonstrating a significant gap in capability.
Architectural Deficiencies: Models consistently produced monolithic, single-file implementations, which starkly contrasts with the modular, multi-file architectures typical of human-written code.

In essence, ProgramBench underscores that while large language models can perform well on specific coding challenges, their ability to handle the architectural and integrative complexities of full-scale software development remains severely limited, often resulting in impractical and unmaintainable designs.

ProgramBench: Can Language Models Rebuild Programs from Scratch?

The Lowdown