Internet Increasingly Becoming Unarchivable

News publishers, including The Guardian and The New York Times, are increasingly blocking the Internet Archive from accessing their content. This proactive measure stems from concerns that AI companies are extensively scraping their intellectual property via the Archive's datasets and APIs for model training. The move ignites a heated Hacker News debate over the vital preservation of historical digital records versus publishers' rights to protect their content and business models from AI exploitation.

Score

Comments

Highest Rank

on Front Page

First Seen

Feb 14, 7:00 PM

Last Seen

Feb 14, 10:00 PM

Rank Over Time

The Lowdown

News publishers, including giants like The Guardian and The New York Times, are increasingly restricting the Internet Archive's access to their content. This move is a direct response to fears that AI companies are leveraging the Archive's vast digital repository to scrape content for model training, raising questions about copyright, compensation, and the future of web preservation.

Publishers are implementing blocks via robots.txt files and directly limiting API access for the Internet Archive.
The Guardian, while supporting the Archive's mission, cited compliance and "backdoor threat" concerns regarding AI scraping.
The New York Times is "hard blocking" the Archive's bot, emphasizing the value of human-led journalism and lawful IP use.
Reddit previously blocked the Archive, having since licensed its data directly to AI companies.
Internet Archive founder Brewster Kahle warns that limiting libraries restricts public access to the historical record, potentially fostering "information disorder."
Evidence shows the Wayback Machine has been a significant source for AI training datasets (e.g., Google's C4 dataset).
An AI company caused a temporary server overload at the Internet Archive by aggressively scraping public domain archives, highlighting the intensity of AI data demands.
A survey of 1,167 news websites showed 241 disallowing at least one Internet Archive bot, with Gannett-owned sites making up a large portion.
Many publishers also block other non-profit crawlers like Common Crawl, alongside commercial AI bots.
Paradoxically, news organizations often lack robust internal archiving, making the Internet Archive crucial for preserving their own work.

This trend highlights a growing conflict between the mission of digital preservation and publishers' efforts to protect their intellectual property and revenue streams from AI exploitation. The implications for historical research, legal compliance, and the public's access to a comprehensive digital record are profound, raising concerns about the long-term integrity and accessibility of online information.

The Gossip

Archival Anxieties: Preserving the Past

Many commenters express dismay at the internet becoming less archivable, emphasizing the importance of a comprehensive historical record for learning, legal compliance, and public discourse. They argue that if content is public, it should inherently be archivable, and fear a future where history can be erased or manipulated. The practical implications for audit trails and proving past compliance are also highlighted.

AI's Archival Assault: A Collateral Conundrum

Commenters acknowledge publishers' concerns about AI scraping, viewing the Internet Archive as collateral damage in the struggle against unauthorized AI training. They discuss publishers' right to protect their intellectual property and business models from being undermined by AI. Some suggest that blocking the Archive is a tactic to extract more revenue or that AI companies will simply find other, less scrupulous ways to obtain data.

The 'Right to be Forgotten' vs. Public Record

A philosophical debate emerges about whether everything on the internet *should* be archived indefinitely. Some argue for a 'right to be forgotten,' suggesting that not all content is valuable enough to preserve, and permanent records could be used against individuals. Conversely, others strongly push back, asserting that the importance of a comprehensive public record outweighs these concerns, suggesting that individuals should simply avoid engaging in behavior they wouldn't want permanently recorded.

Fading Quality & Future Archiving Solutions

Some commenters lament the perceived decline in web content quality, particularly with the rise of AI-generated "slop," questioning the value of archiving such material. Discussion also turns to potential alternative archiving methods, such as distributed crawling networks (e.g., SETI@home for data) or private, restricted archives for academic and journalistic research.