Human Judgment as a Specification

The rise of GenAI in programming necessitates a parallel advancement in formal methods to guarantee that AI systems produce desired solutions. However, the core challenge lies in translating informal human intent into precise, mathematical specifications. The article argues against solely relying on Large Language Models (LLMs) for this translation, as they can propagate human errors, misinterpret ambiguity, or embed common misconceptions, ultimately leading to incorrectly generated specifications.

The authors propose a 'human-in-the-loop' approach, targeting the 'responsible programmer' who values quality but is also time-constrained. Their solution must be 'meaningful' (enabling effective human judgment) and 'moderate' (requiring minimal, high-impact user actions).

PICK's Mechanism: They introduce PICK (Programmer Interactive Classification Kit), a tool that addresses this by not asking users to directly review complex formal specifications. Instead, PICK generates multiple plausible candidate specifications from an LLM prompt.
Interactive Refinement: It then presents the user with concrete scenarios or examples that distinguish between these candidates. The user simply upvotes or downvotes these examples based on their intended behavior.
Domain Agnostic: This workflow, which involves generating candidates, sampling from set differences, and updating scores based on user feedback, has been successfully applied to diverse domains like regular expressions, linear temporal logic (LTL), and attribute-based access control (ABAC).
Underlying Principles: The success across domains stems from two key properties shared by these formalisms: 'closure under negation and intersection' (allowing differences between candidates to be expressed) and 'sampling from that difference' (enabling the generation of concrete distinguishing examples). These are fundamental concepts from theoretical computer science.
Spec Elucidation: PICK goes beyond mere validation; it 'elucidates' user intent. Each user judgment on a concrete behavior serves as an independent witness, helping to clarify what the initial, often vague, prompt truly implied. This process effectively sharpens the user's intent without requiring them to write formal logic themselves.
Useful Failures: The system can 'fail' usefully if no candidate matches the user's clarified intent, preventing the deployment of incorrect specifications.
Resilience to Improvement: The authors argue PICK remains valuable even as LLMs improve, as better models will generate better candidates, but they still cannot intuit the unique, implicit intent of a human user.

In essence, PICK offers a practical and theoretically grounded method to ensure AI-generated specifications accurately reflect human desires, using interactive judgment on concrete examples to bridge the gap between informal thought and formal correctness.

Human Judgment as a Specification

The Lowdown