The Internet Is Under Siege — Not by Cybercriminals, but by a Growing Wave of AI Bots Consuming Bandwidth Like Never Before
Their mission: to crawl and collect vast amounts of content — text, images, and video — to train language models and image generators. But the cost of this activity is being borne by key pillars of open knowledge, like Wikimedia, and thousands of open-source projects operating on limited resources.
Since early 2024, the Wikimedia Foundation has reported a 50% increase in bandwidth usage, particularly in its multimedia repository, Wikimedia Commons. During peak moments — such as following the death of former U.S. President Jimmy Carter — this surge in traffic caused slow page loads and overwhelmed connections for readers.
What’s most concerning is that this isn’t due to increased human interest. The majority of this traffic comes from automated bots — many of them unidentified — scraping content to feed AI systems.
In practice, this means that nearly 65% of connections to Wikimedia’s central servers are being used by crawlers that ignore basic protocols like the robots.txt file, traditionally used to limit automated access to websites.
Wikimedia operates on a “knowledge as a service” model: its content is free and openly reusable — a cornerstone in the development of search engines, voice assistants, and now AI models. But that very openness is starting to work against it.
The situation is even more critical for small open-source projects maintained by communities or individual developers. Many are watching their limited resources being drained by AI bot traffic, causing operating costs to skyrocket — or worse, forcing projects offline altogether.
Gergely Orosz, developer and author of The Software Engineer’s Guidebook, experienced this firsthand: data usage on one of his projects increased sevenfold in a matter of weeks, forcing him to pay penalties for exceeding bandwidth limits.
In response, some developers are going on the offensive. Community-built tools like Nepenthes and corporate solutions like Cloudflare’s AI Labyrinth are deploying “tarpits” — traps filled with fake or irrelevant content (often also AI-generated) designed to confuse and exhaust bots, wasting their resources without providing useful data.
At the heart of this crisis lies a fundamental contradiction: the same openness that enabled AI to flourish is now threatening the survival of the open platforms that made it possible. AI companies benefit from free and open content, but do not contribute to the infrastructure that sustains it. This outsourcing of costs puts the sustainability of the ecosystem at serious risk.
If no new consensus is reached, the greatest threat isn’t that AI will run out of data — it’s that the open spaces feeding it may shut down from exhaustion.