Finetuning Activates Verbatim Recall of Copyrighted Books in LLMs
A new paper and accompanying code reveal that finetuning Large Language Models can activate verbatim recall of copyrighted books, a finding with significant implications for AI development and intellectual property law. This work sparks urgent discussion on whether LLMs are merely sophisticated compression algorithms or genuine memorizers, and the looming legal reckoning for the AI industry. Hacker News debates the future of copyright in a world where AI can effortlessly reproduce protected content.
The Lowdown
This GitHub repository presents the code and methodology behind a paper demonstrating that finetuning Large Language Models (LLMs) can cause them to verbatim recall large portions of copyrighted books they were trained on. The authors provide tools for data preprocessing, model finetuning, memorization evaluation, and analysis, highlighting a critical issue for the future of AI.
- The project provides a pipeline to convert EPUB files into structured JSON excerpts, suitable for finetuning.
- GPT-4o is used for segmenting text and generating plot summaries for finetuning instructions, framing the task as writing excerpts in an author's style.
- Finetuning scripts are offered for OpenAI's GPT-4o, Google's Gemini-2.5-Pro, and DeepSeek-V3.1 models, typically sampling 100 completions per excerpt at temperature 1.0.
- Four key memorization metrics are introduced: BMC@k (fraction of words covered by k-word matching spans), Longest Contiguous Memorized Block, Longest Contiguous Regurgitated Span, and Count of Contiguous Regurgitated Spans > T words.
- Analysis scripts allow examination of cross-excerpt memorization (recalling text from different parts of the book than prompted) and cross-model similarity (whether different models memorize the same regions).
- Notably, the full copyrighted book content and verbatim generations are not included in the repository due to their protected status, with only small examples provided.
The research meticulously details how finetuning can unlock verbatim memorization in LLMs, raising profound questions about the nature of LLM learning, their inherent risks concerning copyright, and the legal frameworks required to address these capabilities. The tools provided enable further investigation into this "Alignment Whack-a-Mole."
The Gossip
Copyright Clash & Consequences
Commenters anticipate a "Napster-style reckoning" for the AI industry, where successful copyright infringement suits against LLM users or developers become inevitable. There's debate on whether such legal action would halt AI development or simply force a shift to proprietary, licensed datasets, and whether existing copyright law is even equipped for this challenge. Some suggest that China's involvement in AI development could prevent a complete shutdown, instead leading to a transformation of copyright itself.
Memorization vs. Mastery
A central discussion point revolves around whether LLMs are "intelligent" or simply advanced compression algorithms. Some argue that verbatim recall indicates sophisticated compression rather than true understanding, drawing parallels to the Kolmogorov limit. Others counter that the "highest compression" might involve recreating the underlying "mind" or emotional structures that produced the original content, suggesting a deeper form of intelligence than mere data regurgitation.
The Ethics of IP & Open Source
The conversation delves into the fundamental purpose and future of copyright law. While some commenters welcome the potential "end of copyright," others argue that copyright is essential for facilitating open-source licenses like the GPL and protecting creators. Critics point out that current copyright primarily serves corporate interests and has extended far beyond its original "limited times" intent, leading to calls for reform rather than outright abolition.