LLMs Corrupt Your Documents When You Delegate
A new research paper introduces DELEGATE-52, a benchmark demonstrating that even frontier Large Language Models (LLMs) significantly corrupt documents during long, delegated workflows. This surprising finding reveals that LLMs silently degrade content, with an average 25% error rate, fundamentally challenging their readiness for reliable automated knowledge work and sparking discussion on the true cost of AI delegation.
The Lowdown
A recent paper introduces a novel benchmark, DELEGATE-52, to assess the reliability of Large Language Models (LLMs) in delegated workflows that require in-depth document editing across 52 diverse professional domains. The study investigates the critical aspect of trust in AI delegation, specifically whether LLMs can execute tasks faithfully without introducing errors.
Key findings from the research include:
- Significant Document Corruption: Across 19 LLMs, including state-of-the-art models like Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4, an average of 25% of document content was corrupted by the end of long workflows.
- Silent and Compounding Errors: The degradation manifests as sparse but severe errors that accumulate over prolonged interaction, often going unnoticed.
- Ineffectiveness of Agentic Tool Use: Integrating agentic tool use did not improve performance on the DELEGATE-52 benchmark.
- Exacerbating Factors: Document size, interaction length, and the presence of distractor files all worsened the severity of document degradation.
The paper concludes that current LLMs are unreliable delegates for tasks requiring precise document modification, highlighting a critical limitation in their practical application for automated knowledge work.
The Gossip
Ablation and Degradation: Naming the Problem
Commenters resonate with the paper's findings, noting their own experiences with LLMs degrading text over time. They propose evocative terms for this phenomenon, such as 'semantic ablation' and 'meanwit reversion,' highlighting a shared understanding of this core LLM limitation where content's quality or meaning erodes with repeated processing.
Real-World Frustrations and Mitigation Tactics
Users share anecdotal evidence of LLMs making 'stupid errors' in practical applications, ranging from misinterpreting file names to failing at markdown conversions. They discuss strategies to mitigate these issues, like using small, purpose-built documents and version control (e.g., Git) to limit the damage, underscoring the need for human oversight and careful task structuring when delegating to LLMs.
Diving Deeper into Errors and Evaluation
Discussion also centers on the technical aspects of the paper, including a appreciation for the evaluation method—testing fidelity via round-tripping through invertible steps. Commenters express interest in understanding the specific types of errors LLMs make (e.g., 'forward pass' vs. 'inverse pass') and question whether specific positive results (like on Python) generalize to other languages or are artifacts of the training process.