Will It Mythos?

This post rigorously benchmarks various LLMs, questioning if publicly available models can truly match the heralded security bug-finding prowess of Anthropic's Mythos. The results reveal surprising top performers and underachievers among LLM families, including highly competitive, cost-effective Chinese models. Hacker News is keenly interested in practical, verifiable LLM capabilities, making this direct comparison a valuable contribution to understanding the current AI landscape.

Score

Comments

Highest Rank

14h

on Front Page

First Seen

Jun 23, 4:00 AM

Last Seen

Jun 23, 5:00 PM

Rank Over Time

The Lowdown

The author, "mindingnever," embarked on an ambitious benchmarking project titled "Will It Mythos?" to quantitatively assess whether currently available large language models (LLMs) can replicate the security vulnerability detection capabilities attributed to Anthropic's highly exclusive Mythos model. Skeptical of the public explanations for Mythos's restricted access, the author aimed to provide data on whether its superior performance in bug finding is genuine or merely marketing hype.

The benchmark uses a corpus of nine security bugs originally identified by Mythos, ensuring they occurred after LLM knowledge cutoffs and are verifiable by top-tier models like Opus when explicitly clued.
Models were tested in a "blind" setting, given a project file and basic tools, with full repository access but no specific hints about the vulnerability's nature or location, simulating a realistic security audit.
Key caveats include the sparse data (single runs per model/bug), the significant cost and time involved, and the potential for models with network access to "cheat" by looking up CVEs, though no such behavior was observed.
Agent findings were surprising: running models through agents generally did not improve performance for most, often increasing costs and time, with only Claude models showing cost benefits in their native agent.
The results revealed a diverse performance landscape:
- Qwen 3.6 27B was a standout, performing exceptionally well for its size and cost, even outperforming some larger commercial models.
- Gemini 3.5 Flash surpassed 3.1 Pro but remained expensive.
- "Cheap Chinese models" like MiMo and DeepSeek were lauded for their strong performance and significantly lower cost, rivaling Opus 4.8 and GPT 5.5.
- Mistral Medium and Laguna M.1 largely failed to find known vulnerabilities, indicating limitations in their security auditing capabilities.
- Haiku and Sonnet were deemed poor value, consuming many tokens without superior results.
- Gemma 4 MoE showed unexpected leadership in detection but struggled with stability, often getting stuck in loops.
Subsequent updates added more models, with varied results, including curious inverse relationships between model size and performance for some Nemotron variants.

In conclusion, while the benchmark found that no public model currently matches Mythos's reported ability to find four specific, difficult bugs, the results suggest that the gap might not be insurmountable. The author speculates that with improved prompting, tooling, or harnesses, current models could potentially achieve similar results, especially given Opus's proven ability to understand these bugs when guided. The project underscores the rapid evolution and diverse capabilities of LLMs in specialized tasks.

The Gossip

Mythos Mystique: Superiority or Savant-Like Social Engineering?

Commenters vigorously debate the true nature of Mythos's (often referred to as Fable by users) reported superiority in infosec tasks. Many users, citing personal experience, claim Fable is "fundamentally much better," acts like a "colleague," and possesses a unique "persistence" or "savant" quality that other models lack, including Opus. Others question if this perceived superiority is an intentional "Anthropic™ character" or "cheap social engineering" radiating unearned confidence, implying a psychological effect on users rather than purely technical prowess. There's also discussion about the potential for Mythos being a fine-tuned version of a base model like Opus with specific steering.

Benchmark Nuances: Clarifying the 'Blind Test'

A point of confusion arose around the benchmark's methodology. Initial statements in the article mentioned that bugs "can be identified by several models if they are pointed directly at it," leading to questions about whether this contradicted the "blind" testing approach. Commenters clarified that the "pointing directly" was a pre-test verification step to ensure the bugs were discernible, while the actual benchmark involved models searching "blind" without specific hints, given only the file and repository context. This discussion helped elucidate the rigor of the testing process.

AI's Security Role: Finding Flaws vs. Building Secure Systems

The discussion extended to the broader implications of AI in cybersecurity: are LLMs better at identifying vulnerabilities in existing code or at generating secure new code? Some argue that models like Opus are "terrifying at infosec" and "very good at finding flaws," potentially surpassing human developers in this aspect, especially for legacy code in unreliable languages like C. However, there's skepticism about their ability to "make a system that doesn't have (security) flaws," emphasizing that building truly secure systems often requires formal verification and specific language choices, which LLMs might struggle with in practical application.