Can LLMs model real-world systems in TLA+?

A new benchmark, SysMoBench, rigorously evaluates Large Language Models' (LLMs) ability to model real-world systems in TLA+, a formal specification language. Researchers found LLMs excel at generating syntactically correct TLA+ but critically fail to capture system-specific behaviors, often 'reciting' general protocol knowledge instead of detailed implementations. This detailed analysis highlights a significant gap in current LLM capabilities for formal methods, sparking considerable discussion on the role of AI in system verification.

Score

Comments

Highest Rank

12h

on Front Page

First Seen

May 9, 12:00 AM

Last Seen

May 9, 11:00 AM

Rank Over Time

The Lowdown

Researchers have developed SysMoBench, a benchmark designed to assess how well Large Language Models can create formal specifications of real-world computing systems using TLA+. Their initial experiences showed LLMs producing plausible-looking specifications that, on closer inspection, merely regurgitated textbook examples of protocols like Raft rather than accurately modeling specific implementations like Etcd's.

SysMoBench evaluates LLM-generated TLA+ specifications across eleven diverse systems, ranging from concurrent synchronization primitives to complex distributed protocols.
The evaluation proceeds through four phases: syntax checking, runtime execution with TLC, conformance validation (comparing spec behavior against real system traces), and invariant checking for safety and liveness properties.
Key findings reveal that LLMs achieve near-perfect scores on syntax but dramatically falter in conformance (averaging 46%) and invariant checks (averaging 41%).
Two common failure modes emerged: LLM-generated specs either entered states that real systems could never reach or failed to reach states that real systems routinely achieve, often due to misrepresenting atomic actions or data structures.
This discrepancy stems from LLMs relying on 'textbook formalization templates' rather than understanding the granular, implementation-specific logic of how a system operates.
The 'Transition Validation' method, central to SysMoBench's conformance phase, provides action-granular diagnostics, pinpointing precisely where a spec's behavior deviates from the actual system's trace.
LLM performance correlated with system complexity, with models doing relatively well on simple systems but struggling significantly on distributed systems like Etcd or RedisRaft.
The team acknowledges open challenges, including improving trace sampling, managing state abstraction, and generalizing the benchmark setup, while also noting the development of 'Specula,' an agent designed to autonomously perform TLA+ formal modeling, which reportedly achieves full scores on current SysMoBench tasks.

In essence, while LLMs can generate grammatically correct formal specifications, their current limitation lies in accurately abstracting and representing the intricate, implementation-specific behaviors of real-world systems, underscoring the need for more sophisticated understanding beyond surface-level pattern matching.

The Gossip

Alternative Approaches to Assurance

Commenters discussed alternative formal methods and tools, questioning whether TLA+ is the optimal choice for all systems. Some championed Lean 4 for its ability to model systems and generate executable code with proofs, while others pointed to approaches like Verus that tightly couple implementation and verification to prevent divergence between model and code. The debate also touched on the viability of 'Lean for everything' versus the continued relevance of domain-specific formalisms for hardware or specific software types.

Human Oversight in LLM-Driven Formalization

A prevalent theme revolved around the critical role of human oversight when using LLMs for formal methods. Concerns were raised about LLMs' struggles with generating correct safety and liveness properties and managing state space explosions without human guidance. Commenters questioned the value and assurance derived if LLMs handle both the design and the program without clear human intent, particularly regarding verifying that an implementation truly matches an LLM-generated specification. Examples included an LLM-generated Monopoly spec that omitted key game mechanics or asserted unprovable liveness properties, illustrating the potential for plausible but flawed outputs.

LLM Progress and Practical Experiences

Despite the benchmark's findings, some users shared positive, anecdotal experiences with LLMs, particularly Claude, in generating TLA+ specifications. One commenter noted impressive progress, describing a complex Monopoly ruleset modeled by Claude that appeared 'passable' (though unverified). While acknowledging ongoing imperfections, these observations reflect a perceived rapid improvement in LLMs' capabilities for formal methods, suggesting that while they may not be perfect, their utility as an assistance tool is growing.