How to Block ChatGPT (And Why to Do It)
Earlier this week, OpenAI, the company behind the generative artificial intelligence (AI) system ChatGPT, announced its latest web crawler, GPTBot, and included steps that webmasters can take to prevent it from indexing websites.
The decision represents a major shift for OpenAI and ChatGPT. Previously, the company used a variety of systems to index content. This meant that there was no simple way to block ChatGPT specifically and webmasters were pushed to block all bots, including more desirable bots like Google and Bing, to prevent indexing by ChatGPT.
Now, webmasters can block ChatGPT directly and without blocking other bots. The company has said that their bot will follow the robots.txt standard and, in doing so, will respect sites that make it clear they don’t want their work indexed for inclusion in ChatGPT.
The move comes less than a month after comedian Sarah Silverman and two other authors, filed a class action lawsuit against OpenAI, alleging that their work was used to train ChatGPT both without their permission and in violation of copyright law.
With that in mind, this shift comes as too little, too late for many creators, who note that the system is not retroactive and has significant limitations in terms of who can actually use it.
Still, this announcement raises a major question for webmasters: Should you block ChatGPT on your site and, if so, how do you go about doing it?
The Basics of Robots.txt
A web crawler (sometimes referred to as a spider) is basically an application that scours the web, or crawls it, for the purpose of capturing the content that it finds.
The most common example is search engines, such as Google and Bing, who use web crawlers to index websites and provide search results. Other examples including using a spider to create an archive of the internet, or for detecting plagiarism in newly-created work.
However, web crawlers can also be used for nefarious purposes. Crawlers have been and are used to collect personal information, grab content for unlawful republication and so forth. As such, it’s often necessary to block web crawlers from accessing some or all of your sites.
Though there are many tools for doing this, one of the most important is the robots.txt file.
Robots.txt is a file that’s placed on a server to provide instructions for crawlers visiting the site. The file doesn’t place any physical or technological barriers against crawling, but provides instructions that legitimate crawlers typically agree to honor.
For example, you can use Robots.txt to allow all web crawlers to access your site, save one or multiple crawler that you wish to block. You can allow access to most of your site but restrict access to certain folders or files. Whether it’s blocking unwanted crawlers, preventing personal information from being indexed or just ensuring content isn’t duplicated, it’s a powerful tool for preventing unwanted crawling.
That said, there’s nothing stopping unethical crawlers from ignoring robots.txt. Many often do, and that is why there’s a whole industry of services that block “bad bots” from accessing your site through other means.
What has changed with ChatGPT is that it has both said it will honor Robots.txt and, at the same time, provided a way to target their crawler specifically. This way, you can block ChatGPT without having to also block all other robots.
That, in turn, is the decision webmasters have to make.
How to Block ChatGPT (and Why)
The first step in the process is accessing and editing your site’s Robots.txt file. How to do that easily varies heavily depending on how your website is built.
WordPress users can use a plugin such as Yoast SEO to quickly and easily edit their Robots.txt file. Wix users can edit their Robots.txt file by following these instructions. Weebly has their instructions here and so forth.
Once you have access to the Robots.txt file, all you have to do is copy and paste the code provided by OpenAI to block the web crawler.
User-agent: GPTBot
Disallow: /
The order it is placed in isn’t important as, according to the Robots.txt standard, crawlers are to only follow the most specific rules. So, for example, this would allow all crawlers except GPTBot.
User-agent: *
Allow: /
User-agent: GPTBot
Disallow: /
Once you add the code and save the file it should take effect immediately, meaning the next time GPTBot visits your site it should see it.
The reasons to do this are plentiful.
From a purely practical standpoint, crawlers like GPTBot do use server resources. Though the amount is typically small and barely noticeable, they can still be an issue for those on hosting plans that have limited resources.
However, the much bigger concern is the use of human-created content to train ChatGPT. Generative AI simply is not possible without mountains of human-created content to train it on and, for both ethical and legal reasons, many don’t want their work used that way.
For those that feel that way, this provides a relatively simple way to make your preference known. That said, there are still some pretty significant limitations and reasons why even the most enthusiastic anti-AI creators may want to pause before making this move.
The Limitations of This Approach
Robots.txt is a robust standard that has been around for nearly 30 years. However, it’s far from a complete solution to the issue of unwanted bots.
The biggest limitation is that using it requires a degree of access to the server that many creators simply don’t have.
Robots.txt is fine if you use WordPress, Weebly, Wix or any of the other services that allow users to edit their Robots.txt file. However, many services, like Squarespace, don’t allow users to edit the file, meaning that there is no way for them to block ChatGPT.
Then there are situations like Medium, Tumblr and Substack where creators don’t control the server at all and can’t set the rules. They have to rely on the company itself to do it for them (and all other users). The same is also true for those who primarily post on social media sites such as Facebook, Instagram, X (Twitter) and so forth.
But, even if you are in a position to block ChatGPT, you run into the biggest limitation of all: The move isn’t retroactive.
It’s a virtual guarantee that your site has already been crawled by OpenAI or one of its partners. Blocking it today does nothing to remove all that content from the system.
For this site, that represents nearly 18 years of history and some 5,600 posts. Blocking new works seems, to use the cliché, is like closing the barn door after the horses have escaped.
Finally, it doesn’t do anything to block other AI systems. ChatGPT may be the most famous AI system, but it’s far from alone. Google has Bard, Meta has Llama 2 and so forth. None of those systems are blocked by this approach.
As such, there are plenty of valid reasons to either not bother to make this change or actively decide against it. This is especially true if you’re not wanting to validate OpenAI’s behavior, namely their lip service to creators, by following their new crawling standard.
Bottom Line
In the end, if you are a webmaster that can edit your Robots.txt and don’t want your content being used by AI systems, there’s little reason not to do this. Yes, it does nothing to address previous non-consensual crawling and doesn’t stop other AI systems, but it is a small improvement.
That said, the real shame here is on both OpenAI and other AI systems. This is an acknowledgement that there are reasons webmasters and creators may not want their content used to train an AI system. However, the fact that OpenAI implemented this feature today and not when they first began crawling the web for this purpose.
Furthermore, the fact that it is not retroactive shows that they are perfectly happy to use content they have no clear permission to access, they just want the public relations credit for giving webmasters control moving forward. However, realistically, that control was lost months or years ago when this project began scraping content.
For those who say that this is too little, too late, I completely agree.
OpenAI could have done this much earlier, offered a way for creators to clearly opt out, or simply made this retroactive. They did none of those things because they would have hurt the quality of ChatGPT and meant spending resources.
Still, if you are a webmaster and don’t want your content used to train AI systems, this is something you can and probably should do.
It isn’t much, and it’s far less than what OpenAI could have and should have offered, but it’s still something.
Want to Reuse or Republish this Content?
If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.