The Battle Against the Bots

Wikimedia Foundation Logo

Last week, The Wikimedia Foundation announced that bandwidth used for downloading multimedia content had risen 50% over the past year. However, that bandwidth wasn’t coming from human readers. It was coming from aggressive bots scraping for AI companies.

This came to a head in January 2025. Following President Jimmy Carter’s death, AI bots began repeatedly accessing and downloading a 1.5-hour video of Carter’s 1980 presidential debate. At its peak, the spike saturated all of Wikipedia’s bandwidth, resulting in slowdowns for human visitors.

According to the Wikimedia Foundation, 65% of the most expensive traffic they see comes from bots. As such, they are working on new policies to promote the responsible use of their infrastructure.

This fight isn’t new. Ten years ago, the company Distil announced that its findings indicated 25% of all traffic was from “bad bots,” while humans made up just 54%. However, the problem has only grown since then.

Imperva, which purchased Distil in 2019, said bad bots comprised 32% of all internet traffic in 2023. However, that number varied wildly by industry, with gaming sites seeing over 57% of their traffic come from malicious bots.

So, what has changed to make the issue so pressing in 2025? The answer is two-fold.

The Bot Army Keeps Growing

Most webmasters and site owners segregate their traffic into three categories: bad bots, good bots and humans.

Humans are self-explanatory. Those are the human visitors accessing and interacting with the site as intended.

However, the separation between bad and good bots is often foggy. Good bots are simply bots that the site owner wants to access the site. This includes search engine crawlers, bots that check for site issues and anything else the owners see as beneficial.

Bad bots, on the other hand, are simply undesirable bots. They can include content scrapers, bots searching for security vulnerabilities, or even “good bots” that misbehave in a harmful way.

Historically, scraping has been a key focus of this site. For decades, bots have scraped websites and RSS feeds to republish content, including copy/paste republication, article spinning and other techniques.

Webmasters have fought back in various ways. Robots.txt, for example, limits access to bots that obey the standard. However, companies such as Distil (now Imperva), Cloudflare and CloudFilt offer services to block bad bots.

However, the internet is not the same as it was ten years ago. And, though AI has increased the problem, it’s not the sole driver.

A More Centralized Web

When it comes to bad bots, two things about the web have changed significantly over the past 10-20 years.

The most significant and gradual change is that the web has become more centralized. Previously, individuals would launch standalone sites for most projects. Today, they typically start on social media sites or other aggregators.

This makes sense. Those sites have a large built-in audience that can only be accessed from within. But it’s made sites like Reddit, Facebook, TikTok, YouTube, Tumblr, Substack, Medium, etc. huge silos of targeted content. This has made the attacks much more focused.

For a bad bot, scraping individual sites is nowhere near as lucrative as accessing these silos. Increasingly sophisticated bots are targeting them. According to the same Imperva report, more than 60% of bad bots are “evasive bad bots” that show some sophistication.

Since most people don’t host their own sites, they have to trust the companies they host their presence with to protect them from bots. The interests generally align, as most large companies don’t want bad bots either. However, sometimes they don’t. As when Reddit signed a massive licensing deal with Google for AI training.

However, large companies often have little motivation to address the issue, even when the interests align. After all, bad bots rarely cause significant harm to their infrastructure, so making a large investment in stopping them doesn’t make financial sense.

And then came AI…

The Rise of the AI Bots

Obviously, generative AI has massively disrupted the internet in countless ways that we are just beginning to understand.

However, from a bot perspective, the impact was relatively simple: It created an army of well-funded bots that are often dubious both ethically and legally.

The gold rush into AI has seen billions of dollars pumped into the key players, including new upstarts and industry stalwarts.

Before AI, the largest and best-funded bots were all from search engines. Though these bots have their detractors, they generally obey the wishes of websites and are widely welcomed because the benefit (being in the search engines) outweighs any cost.

AI bots, on the other hand, offer no real benefit to creators. However, keeping them out is a real challenge, especially since many ignore robots.txt and other industry standards. This has created a cat-and-mouse game for many as they try to keep these bots out.

AI has tipped the scales of the bot battle. Now, the “bad bots” are better funded and more connected than most sites they access. They can also launch a handful of concentrated attacks and scoop up large amounts of data.

This makes it very difficult to fight back. However, some have still found a way.

Fighting Back

One of the more popular techniques for fighting back against bad bots, in particular ones for AI companies, has been the creation of honeypots.

Honeypots detect bots that ignore user wishes and feed them large amounts of useless data. This can include poisoning the AI’s training data or overloading the bot itself.

However, once again, this creates a cat-and-mouse game where the honeypots must be updated when discovered. While it is a clever solution, it’s likely not feasible in the long term and does little to prevent the scraping of the original content.

A more common solution is using edge services that detect and block the bots entirely. However, this has the same issue, as the bots will change and work to evade these blocks. Fortunately, this does at least reduce the scraping of the original content.

Ultimately, this is why so many copyright infringement lawsuits against AI companies have been filed. Most large copyright holders realize there is no technical solution to this issue. No blocking strategy is feasible in the long term, especially on sites they don’t control. Even if it is, it can’t address what has already been scraped.

As such, many have turned to the courts.

Bottom Line

In many ways, Wikimedia’s situation is a canary in the coal mine. The foundation is well-funded and has a professional team dedicated to infrastructure. However, its resources still pale in comparison to those of large tech companies.

Their struggles with AI bots mean that this issue is growing and likely having a greater impact on the internet’s infrastructure than we realize.

While the debate about AI’s legitimacy rages on, we should also discuss the nuts and bolts of how AI systems scrape content. Legal and ethical issues aside, bad bots can do tangible damage to the sites they access.

It may be a side point, but it is important, especially if we want the internet to remain a useful tool for decades to come.

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free