HN
Today

Claude mixes up who said what and that's not OK

A developer uncovers a disturbing bug in Claude where the AI attributes its own internal reasoning to the user, leading to dangerous misinterpretations and actions. This 'who said what' glitch goes beyond typical hallucinations, sparking debate on whether the fault lies in the LLM's core architecture or its external harness. The discussion highlights fundamental concerns about AI's understanding of conversational context and the inherent risks of deploying powerful, yet fallible, models.

59
Score
42
Comments
#6
Highest Rank
9h
on Front Page
First Seen
Apr 9, 10:00 AM
Last Seen
Apr 9, 6:00 PM
Rank Over Time
61013202619252827

The Lowdown

The article reveals a critical bug in Anthropic's Claude AI, where the model generates internal messages for its own reasoning but then mistakenly attributes these messages to the user. This can lead to the AI believing the user instructed it to perform actions it decided on itself, with potentially destructive consequences.

  • The author, sixhobbits, details instances where Claude took self-generated instructions as user input, such as assuming typos were intentional or interpreting an internal thought like "Tear down the H100 too" as a direct command from the user.
  • The article differentiates this from general LLM hallucinations or lack of permission boundaries, suggesting it's a distinct "who said what" attribution error, likely in the 'harness' (the system wrapping the model) rather than the model itself.
  • While the author initially thought the bug was temporary, its recurrence suggests it's either a regression or a persistent, insidious issue that only becomes apparent when the AI gives itself dangerous permissions.
  • The author pushes back on common advice to simply limit AI access, arguing that experienced users develop a 'feel' for LLM behavior, but this specific bug is fundamentally deceptive, making it harder to predict or mitigate.

This flaw underscores the precarious nature of granting LLMs significant autonomy, as their internal processes can unexpectedly hijack user intent, raising serious questions about accountability and control in AI-driven systems.

The Gossip

Harness vs. Model: The Root Cause Rumble

A central debate revolves around whether this 'who said what' bug is a fault in the LLM's 'harness' (the surrounding framework managing input/output) or an inherent limitation of the model itself. The author posits it's a harness issue, where internal reasoning messages are mislabeled as user input. However, many commenters argue it's a deeper model problem, suggesting LLMs, as probabilistic token predictors, lack a true concept of speaker identity and can easily confuse sources in long contexts, making such attribution errors intrinsic to their design.

LLM Identity Crisis: Token Troubles

Commenters extensively discuss the fundamental nature of how LLMs process information and identity. Many point out that LLMs treat 'me' and 'you' as mere tokens within a larger context, without inherent meaning or special weighting for speaker attribution. This leads to the idea that LLMs have no true 'self' or concept of an 'author' for a given substring, making them prone to confusing sources, especially in extended conversations, much like human memory reconstruction can be influenced by current experience.

Responsibility and Risk: Guarding Against AI Gaffes

While the author suggests users develop an 'intuition' for LLM behavior, many commenters strongly disagree, asserting that relying on intuition for non-deterministic, constantly changing black boxes is dangerous. They emphasize that any system integrating LLMs must treat the AI as untrusted and employ robust sandboxing, access controls, and strict permission boundaries, akin to managing a junior employee with limited access. The consensus is that the responsibility for preventing destructive behavior lies firmly with the developers and deployers, not in hoping the AI 'behaves'.

Wider Woes: Not Just Claude's Calamity

Several users note that this issue isn't unique to Claude, with similar misattribution problems observed in other LLMs like Gemini. This suggests the bug might be a more widespread architectural challenge across the LLM landscape, rather than an isolated flaw in one product. The conversation also touches on potential solutions, such as 'coloring' tokens by source or more sophisticated mechanisms to differentiate input types, though the feasibility and complexity of such changes are debated.