
The official websites and source code repositories of open-source and free software projects are typically accessible to the public. However, maintaining such open access requires substantial server infrastructure and bandwidth. Under normal circumstances, genuine user visits exert minimal pressure on these systems.
An administrator of the renowned GNOME desktop environment recently shared traffic analysis data revealing a troubling trend: within a span of just 2.5 hours, GNOME servers received over 81,000 requests—yet only 3% of these successfully completed Anubi’s proof-of-work challenge. This suggests that the remaining 97% were generated by bots rather than human users.
These bots, particularly those operated by AI companies, often disregard the conventions set by the robots.txt
protocol. Armed with vast pools of IP addresses, they bombard open-source project websites with concurrent requests in pursuit of data for model training—parasitically exploiting public resources.
To mitigate the relentless drain on hardware and network bandwidth, GNOME has resorted to deploying Anubi, a proof-of-work system designed to fend off AI scrapers. However, this measure occasionally affects legitimate users, inadvertently obstructing real human traffic.
GNOME is far from alone in this predicament. Other projects such as KDE, Fedora, LWN, and Frame Software are grappling with the same onslaught: the vast majority of their web traffic now originates from voracious AI crawlers, in a phenomenon that resembles a distributed denial-of-service (DDoS) attack in both scale and impact.
There exists no definitive solution to this emerging threat. Administrators are being forced to expend considerable time, money, and technical resources in a losing battle against these insatiable AI engines—companies ruthlessly harvesting data to feed their training algorithms.
Instances abound: OpenAI’s ChatGPT and ByteDance’s Bytespider have previously been implicated in overwhelming websites with high-frequency scraping, pushing them to the brink of collapse. These bots often ignore robots.txt
or overwhelm servers through massive concurrent requests, rendering the infrastructure nearly inoperable.
For the AI companies, such tactics bear no cost. But for the targeted websites, the consequences are dire—squandered server resources, increased operational complexity, and the burden of developing detection and mitigation strategies. In the end, it is the open-source community that bears the brunt of the damage.
While some well-known crawlers, like GPTBot, can still be blocked based on their user-agent strings, a significant portion of these bots operate covertly—masquerading as mobile users and concealing their identities. In such cases, relying on user-agent detection becomes an arduous and often ineffective endeavor.