HN
Today

GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2

A new analysis reveals that despite their massive size, leading frontier models like GPT-5.5 hallucinate at significantly higher rates than smaller, open-source alternatives. This challenges the long-held 'bigger is better' paradigm, suggesting intelligence may plateau or even degrade with endless scaling. The findings spark significant discussion on Hacker News about the real-world utility and maintainability of LLM-generated content, especially code, and the limitations of current evaluation metrics.

141
Score
38
Comments
#7
Highest Rank
11h
on Front Page
First Seen
Jun 20, 7:00 AM
Last Seen
Jun 20, 5:00 PM
Rank Over Time
18141087101013172124

The Lowdown

The article critically examines the prevailing notion that larger language models inherently lead to better performance, specifically highlighting a startling disparity in hallucination rates. It posits that while bigger models might score higher on raw capability benchmarks, they often struggle with truthfulness and recognizing when they 'don't know' an answer.

  • AI labs are reportedly growing skeptical of the continuous scaling of parameter counts and training data, with real-world incidents like the Claude Fable 5 ban underscoring potential risks.
  • GLM-5.2, an open-source model with 753 billion parameters, achieves intelligence index scores remarkably close to proprietary, much larger models like GPT-5.5 (estimated 1-2 trillion parameters).
  • However, on the AA-Omniscience hallucination benchmark, GPT-5.5 scored an 86% hallucination rate, and DeepSeek V4 Pro an even higher 94%, meaning they confidently fabricate answers rather than admit uncertainty.
  • In contrast, GLM-5.2 showed a significantly lower hallucination rate of 28%, demonstrating a better ability to identify logical fallacies and technical impossibilities.
  • The author argues that immense model size can lead to a failure to learn how to express uncertainty, creating models that actively convince users of incorrect solutions.
  • This problem introduces a 'trilemma' for modern AI development: balancing raw capability, uncertainty calibration (hallucination rate), and computational efficiency.

The industry must shift its focus from blind scaling to a more nuanced approach that prioritizes truthfulness and efficiency alongside raw intelligence, acknowledging that simply making models bigger may no longer yield proportional or even beneficial returns.

The Gossip

Metric Muddle and Measurement Musings

Commenters debated the definition and interpretation of hallucination rates. Some argued the presented metrics are tricky to interpret as they're conditional on a model not knowing the answer, suggesting an "absolute hallucination rate" would be more useful. Others countered that any made-up answer, regardless of whether the model "knew" the real answer, should count as a hallucination. There was also discussion on how prompt engineering might influence these rates, with some suggesting that detailed prompting could significantly alter a model's performance on these benchmarks.

Scaling Skepticism and Subtlety

The core premise that "bigger is not better" was discussed, with some commenters reinforcing the significance of this finding, particularly concerning scaling limits and their implications for AI valuations. Others offered anecdotal counter-examples, noting that even models like GLM 5.2 can subtly hallucinate or misinterpret user intent in real-world scenarios. There was also speculation that hallucination is more a product of biases in training data and the absence of mechanisms to "fear" being wrong (like a human amygdala), rather than solely model size, though the author did acknowledge a correlation between larger models, larger training data, and potential overfitting.

Code Quality Conundrums

A significant part of the discussion revolved around the impact of LLMs on code quality and maintainability. Some expressed strong apprehension that LLM-generated code, while appearing functional, might contain subtle flaws that compound over time, leading to difficult-to-maintain "hot garbage" and unreadable codebases. Conversely, others argued that with proper human oversight, reviews, and guardrails like linters and architectural tests, LLM-assisted coding can produce maintainable and even improved codebases, noting that human-written code often suffers from similar issues or worse. They emphasized that LLMs are powerful assistants, but their output still requires human expertise and validation.

Human vs. Halting AI

Commenters pondered the fairness of expecting perfect non-hallucination from LLMs, asking whether human hallucination rates are ever measured. The author acknowledged that humans also exhibit a Dunning-Kruger effect, suggesting a potential parallel in confidently incorrect responses and a lack of self-awareness regarding one's own ignorance. This led to reflections on whether AI should be held to a higher standard than human cognition.