Reddit has announced a significant restriction on the Internet Archive’s ability to index its vast content, citing concerns that artificial intelligence (AI) companies are illicitly scraping data from the Wayback Machine. This move is designed to protect user privacy and control access to Reddit’s extensive user-generated content.
Effective immediately, the Internet Archive’s Wayback Machine will no longer be permitted to crawl most of Reddit’s site. This includes critical areas such as post detail pages, individual comments, and user profiles. Consequently, the Wayback Machine will only be able to index Reddit’s main homepage, offering a severely limited snapshot of trending topics rather than a comprehensive historical archive.
A spokesperson for Reddit, Tim Rathschmidt, explained the decision, stating, “Internet Archive provides a service to the open web, but we’ve been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine.” He further emphasized Reddit’s commitment to safeguarding its users: “Until they’re able to defend their site and comply with platform policies (e.g., respecting user privacy, re: deleting removed content) we’re limiting some of their access to Reddit data to protect redditors.”
The Internet Archive’s core mission revolves around creating a digital repository of websites and cultural artifacts, with the Wayback Machine serving as its primary tool for historical web page viewing. However, Reddit contends that not all of its content should be archived in a manner that facilitates policy violations, particularly regarding user privacy and the retention of content that users may have since removed.
Reddit confirmed that it informed the Internet Archive “in advance” of these access limits, which began “ramping up” today. Rathschmidt also noted that Reddit has “raised concerns” in the past regarding the ease with which content can be scraped from the Internet Archive.
This action aligns with Reddit’s recent history of tightening control over its data in response to the proliferation of AI training models. The platform has previously taken steps to block widespread scraping, offering data access only to companies willing to pay. Notably, Reddit struck a major deal with Google early last year for both search and AI training data. This was followed by a policy shift blocking major search engines from crawling its data without a commercial agreement. Furthermore, Reddit attributed its controversial API changes in 2023, which led to widespread protests and the shutdown of many third-party apps, to the misuse of those APIs for AI model training.
While Reddit has forged an AI deal with OpenAI, it also initiated a lawsuit against Anthropic in June, alleging continued scraping from Reddit despite Anthropic’s assurances to the contrary.
Mark Graham, director of the Wayback Machine, provided a statement to The Verge regarding the situation: “We have a longstanding relationship with Reddit and continue to have ongoing discussions about this matter.”
Update, August 11th: Added statement from the Wayback Machine.