AI Crawlers Have Led to a 50% Increase in Wikimedia Commons’ Bandwidth Usage

AI Crawlers Have Led to a 50% Increase in Wikimedia Commons’ Bandwidth Usage

The Wikimedia Foundation, which oversees Wikipedia and several other crowdsourced knowledge projects, reported on Wednesday that multimedia download bandwidth from Wikimedia Commons has risen by 50% since January 2024.
Image Credits:Riccardo Milani / Hans Lucas / Hans Lucas via AFP / Getty Images

The Wikimedia Foundation, which oversees Wikipedia and several other crowdsourced knowledge projects, reported on Wednesday that multimedia download bandwidth from Wikimedia Commons has risen by 50% since January 2024.

According to a blog post published Tuesday, this surge isn’t driven by human users but by automated scrapers gathering data to train AI models.

Our infrastructure is designed to handle sudden spikes in human traffic during major events, but the scale of traffic from scraper bots is unprecedented, posing increasing risks and costs,” the post stated.

Wikimedia Commons serves as an open-access repository for images, videos, and audio files, all available under open licenses or as public domain content.

Bots Drive Majority of Resource-Heavy Traffic on Wikimedia

Wikimedia reports that nearly two-thirds (65%) of its most resource-intensive traffic—content that requires the most processing power to serve—comes from bots. However, these bots account for only 35% of total pageviews. This imbalance occurs because frequently accessed content remains cached closer to users, whereas less popular content is stored in the core data center, making it more expensive to retrieve. Bots typically target this less-accessed content, increasing resource demands.

While human readers tend to focus on specific – often similar – topics, crawler bots tend to ‘bulk read’ larger numbers of pages and visit also the less popular pages,” Wikimedia explains. “This means these types of requests are more likely to get forwarded to the core datacenter, which makes it much more expensive in terms of consumption of our resources.”

As a result, Wikimedia’s site reliability team must invest significant time and resources in blocking crawlers to prevent disruptions for regular users—on top of the growing cloud costs the Foundation faces.

AI Crawlers Intensify Threats to the Open Internet

More broadly, this highlights a troubling trend threatening the open internet. Last month, software engineer and open-source advocate Drew DeVault criticized AI crawlers for disregarding “robots.txt” files meant to block automated access. Similarly, tech writer Gergely Orosz noted that AI scrapers from companies like Meta have significantly increased bandwidth demands for his projects.

Open-source infrastructure is a prime target, but developers are pushing back with both ingenuity and determination, as TechCrunch reported last week. Some tech companies are also stepping in to tackle the problem—Cloudflare, for instance, recently introduced AI Labyrinth, a tool that generates AI-created content to hinder web crawlers.

Still, this remains a constant game of cat and mouse, one that could eventually drive many publishers to hide their content behind logins and paywalls—ultimately making the internet less accessible for everyone.


Read the original article on: TechCrunch

Read more: Researchers Suggest OpenAI Trained its Models on Paywalled O’Reilly Books

Share this post

Leave a Reply