Blocking Internet Archive Won't Stop AI, but Will Erase Web's Historical Record
Major publishers, including The New York Times, are blocking the Internet Archive from crawling their sites, citing concerns over AI content scraping. This decision, however, inadvertently threatens to erase decades of invaluable digital historical records relied upon by journalists and researchers worldwide. The situation sparks concern on HN as it highlights the profound, unintended consequences of the AI boom on digital preservation and the fundamental principles of fair use.
The Lowdown
Major news publishers are increasingly implementing aggressive technical measures to block the Internet Archive from crawling and preserving their websites. While publishers like The New York Times state these actions are a response to concerns about AI companies scraping content for model training, the Electronic Frontier Foundation (EFF) argues that this approach is misguided and has severe, unintended consequences for digital history.
- The Internet Archive, operating the Wayback Machine, is the world's largest digital library, dedicated to preserving the web's historical record since the mid-1990s.
- Publishers are using methods beyond standard
robots.txtto prevent archiving, risking the loss of critical historical documentation. - This historical record is vital for journalists, researchers, and courts, often being the only reliable source for how stories were originally published and subsequently altered.
- The EFF contends that blocking a nonprofit like the Internet Archive, which is not building commercial AI systems, is the wrong way to address AI scraping issues.
- They emphasize that archiving and making material searchable is a well-established fair use principle, as demonstrated by previous legal precedents like the Google Books case.
- The article warns that sacrificing the public's access to the historical web record in an attempt to control AI training could lead to an irreversible loss of information.
The article ultimately concludes that while disputes over AI training need resolution, conflating commercial AI activities with essential digital preservation efforts is a dangerous and potentially irreparable error, jeopardizing the future accessibility of our shared online history.