Blocking Bad Bots with Robots.txt

When it comes to things crawling your site, there are good bots and bad bots. Good bots, like Google’s spider, crawl your site to index it for search engines or provide some other symbiotic use. Others spider your site for more nefarious reasons such as stripping out your content for republishing, downloading whole archives of your site or extracting your images.

The question is simple. How do you block the bad bots while welcoming the good ones in? Fortunately there is a standard, robots.txt, that can do just that.

However, editing your robots.txt file can be a daunting task, even for experienced Webmasters, as it requires mastering a very specific and somewhat unusual format. To make matters worse, even those who understand how such files work rarely know the names of all the bots to let in or block, limiting what they can do with it.

Fortunately, there is both a WordPress plugin and a robots.txt generator that can help. What is unclear, however, is if they will actually work.

A Quick Word About Robots.txt

The basic idea behind robots.txt is that it is meant to be guide for robots (or spiders are they are often called) when they visit your site. It tells the bot the pages and directories they can and can not visit and can either set up blanket instructions for all bots or instructions for specific bots.

This is done by, appropriately, including a file titled robots.txt in the root of your server. For example, Plagiarism Today’s (deliberately open) robots.txt file can be found at https://www.plagiarismtoday.com/robots.txt.

Since robots.txt makes it possible to filter bots by their identifier, it can be used to let only certain spiders into your site while keeping others out. The problem is that there are, quite literally, hundreds of different spiders out there and new ones being written all the time. It is almost impossible to keep on top of what spiders you should allow and those that you should banish.

Fortunately, there are two tools that may be able to help you do exactly that by giving you the information you need to build a robots.txt file that’s capable of keeping at least some of the bad bots at bay.

Building a Better Robots.txt

To help you build an effective robots.txt file, there are two tools you may wish to look at:

  1. Robots.txt WordPress Plugin: Written by Peter Coughlin, this WordPress plugin largely automates the process of building a robots.txt file that can keep bad bots out while letting in the good ones. Though largely hands-free, you can tweak your own robots.txt file through this plugin, eliminating the need to access it via FTP. Also, the plugin is careful not to overwrite any existing file and will uninstall gracefully if requested.
  2. Robots.txt Bulder: Provided by David Naylor, this robots.txt builder lets you select from categories of bots, including search engines, archivers and, of course, “bad robots” and decide the level of access each group has. When done, simply paste the results from the builder into your robots.txt file and upload it to your server.

The two tools are actually related as Coughlin’s WordPress plugin actually uses Naylor’s list of bad bots to help it determine which bots to keep out. So, in that regard, the WordPress plugin is essential an automated install of Naylor’s list for WordPress users.

This also means that both methods function largely the same way, by compiling a list of bad bots and, using robots.txt, telling them to keep out while throwing the doors wide open to other kinds of spiders that may wish to crawl your site.

However, it is unclear exactly how effective this method is and the reason is the nature of the bad bots themselves.

Limitations and Concerns

The problem with using robots.txt in this way is that, for it to work, it requires the cooperation of the robots themselves. Robots.txt doesn’t do anything to actually restrict the bots from accessing various parts of your site, merely tell them where you wish to allow them to go or not go. In short, if a bot wants to ignore your robots.txt guidelines, they can.

Legitimate bots obey robots.txt for a variety of legal and ethical reasons. Googlebot, for example, will always adhere to your robots.txt rules. Bad bots, however, are free to ignore them and often do.

This doesn’t mean using robots.txt to block bad bots is completely ineffective. The reason is that many “bad” bots are actually good or neutral ones used for bad reasons. For example, there are many download spiders that can be used for good or bad reasons but, unless you have a specific reason to allow them, you should probably block them as they serve little positive use for you. Those will, by in large, obey robots.txt.

Also, many bad bots will obey robots.txt simply because so few sites bother to block them and there is little reason for them not to. Since not following the standard can open up potential legal challenges against them, including potential copyright issues and tresspass to chattels, though such a theory presents its own dangers and challenges. there is great risk in not following them and little to gain by ignoring them.

There are still other ways around this. Some nefarious bots will change their name, even randomize it to avoid being targeted by robots.txt (instead gaining the rights granted all other bots not specifically listed, which is usually more liberal) but most that don’t want to obey robots.txt will simply not do so and unless the server takes some additional effort to block them, such as using an IP or user-agent block, both of which can also be mitigated against, there isn’t much to stop them.

One other possibility is to use cloaking to trick scrapers into grabbing the wrong content, but that possibility also carries its own risks with it.

Bottom Line

Given the fact that there is very little risk with manipulating your robots.txt file to block bad bots, there is very little reason to not do it. It may not block all or even most, but it can block some and that can make it of some small benefit.

The problem is that, if more webmasters start blocking bad bots via robots.txt, the method will become less effective as more and more decide the risk of ignoring it is worthwhile. However, that would require a very large number engage in the practice, which is unlikely considering that many sites don’t even have control over their robots.txt file.

In short, if you can do it, you probably should. It’s an extra layer of protection against those who might wish to misuse your site or your content. It is not much, but certainly every little bit helps.