HN
Today

GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance

OpenAI's GPT-5.5 Codex model appears to be hitting unusual reasoning-token ceilings, disproportionately clustering at values like 516, 1034, and 1552. This technical deep dive provides statistical evidence suggesting a performance degradation on complex tasks unique to this model. Hacker News finds this fascinating as it offers a rare glimpse into the opaque internal workings and potential flaws of a prominent AI system, sparking discussions about model stability and hidden architectural decisions.

35
Score
4
Comments
#2
Highest Rank
18h
on Front Page
First Seen
Jul 4, 10:00 PM
Last Seen
Jul 5, 4:00 PM
Rank Over Time
223332246566845252829

The Lowdown

An intriguing investigation has uncovered an anomaly in the behavior of OpenAI's GPT-5.5 Codex model, where its reasoning_output_tokens frequently cluster at specific, fixed values. This observed phenomenon, particularly at 516, 1034, and 1552 tokens, suggests a potential underlying issue leading to degraded performance on complex tasks. The findings present a strong statistical case for a non-random pattern in the model's output generation.

  • The core observation reveals that GPT-5.5 Codex responses disproportionately terminate with exactly 516 reasoning_output_tokens, with other spikes at 1034 and 1552.
  • This clustering is model-specific, with GPT-5.5 accounting for 82.0% of exact-516 events despite comprising only 19.3% of all responses analyzed.
  • The anomaly coincides with a decline in overall reasoning-token intensity for GPT-5.5, suggesting a potential link to reduced problem-solving depth.
  • Statistical evidence from February to June 2026 shows a sharp increase in this exact-516 clustering for GPT-5.5, while mean and P90 reasoning tokens concurrently decreased.
  • The author postulates that this behavior might indicate a hidden reasoning-budget cap, truncation, routing, or scheduler mechanism within the model.
  • The report urges OpenAI's Codex team to investigate these thresholds and clarify whether this is expected behavior, a budget constraint, or a sign of degradation.

This detailed analysis raises important questions about the internal mechanisms and consistency of advanced AI models, offering a data-driven perspective on how hidden architectural choices can manifest as performance quirks.

The Gossip

User Experience Echoes

Commenters resonate with the findings, reporting their own experiences of degraded performance when using GPT-5.5 Codex for reasoning-heavy tasks. One user strongly states that the model is no longer suitable for complex reasoning and notes a significant "delta on intelligence" compared to other models or previous versions, directly supporting the article's hypothesis of performance issues.