Blocking Internet Archive Won't Stop AI, but Will Erase Web's Historical Record

Major news publishers are increasingly implementing aggressive technical measures to block the Internet Archive from crawling and preserving their websites. While publishers like The New York Times state these actions are a response to concerns about AI companies scraping content for model training, the Electronic Frontier Foundation (EFF) argues that this approach is misguided and has severe, unintended consequences for digital history.

The Internet Archive, operating the Wayback Machine, is the world's largest digital library, dedicated to preserving the web's historical record since the mid-1990s.
Publishers are using methods beyond standard robots.txt to prevent archiving, risking the loss of critical historical documentation.
This historical record is vital for journalists, researchers, and courts, often being the only reliable source for how stories were originally published and subsequently altered.
The EFF contends that blocking a nonprofit like the Internet Archive, which is not building commercial AI systems, is the wrong way to address AI scraping issues.
They emphasize that archiving and making material searchable is a well-established fair use principle, as demonstrated by previous legal precedents like the Google Books case.
The article warns that sacrificing the public's access to the historical web record in an attempt to control AI training could lead to an irreversible loss of information.

The article ultimately concludes that while disputes over AI training need resolution, conflating commercial AI activities with essential digital preservation efforts is a dangerous and potentially irreparable error, jeopardizing the future accessibility of our shared online history.

Blocking Internet Archive Won't Stop AI, but Will Erase Web's Historical Record

The Lowdown