ChatGPT Ignores Robots.txt, Rehashes My Column

OpenAI logo

As many of you may know, I publish a 3 Count column every Monday through Thursday, highlighting three important copyright-related stories. I began this column in 2009, and according to my WordPress, it includes nearly 3,000 columns to date.

One thing I rarely discuss is how, at times, it’s difficult to bring the column together. Most nights, I begin looking for stories before heading to bed, as writing the 3 Count is one of the first things I do in the morning. This has always been challenging during slow news periods, but the degradation of Google News and the increased use of paywalls have made it more difficult than ever.

Last night was one of those difficult nights.

So, I decided to give Google Gemini a shot. I, like many Google Workspace users, was forced into paying for Gemini. So, I gave it the prompt, “What is the latest copyright news from the past 24 hours?”

The results were disappointing. It listed three stories, all of which were much older than one day. One was literally from three weeks ago.

I decided to see if ChatGPT could do any better. I gave the free version the same prompt and received a very different response. This one was even worse than Gemini’s. Though the information was more accurate, it was also mine. ChatGPT simply rewrote and presented my previous 3 Count column back to me.

ChatGPT's Response

While this is a clear embarrassment for ChatGPT, there’s an even bigger issue: ChatGPT wasn’t supposed to be on my site at all.

An Ineffective Block, A Worse Answer

In August 2023, I published an article titled How to Block ChatGPT (And Why to Do It). There, I provided simple instructions, based on OpenAI’s own directions, about how to prevent ChatGPT, or rather, GPTbot, from indexing your site.

I added the code to this site’s robots.txt at the same time. Because I hadn’t bothered to expand the block beyond GPTBot, it’s currently the only block expressly disallowed from this site. That was nearly two years ago, and according to OpenAI’s own documents, “disallowing GPTbot to indicate that crawled content should not be used for training OpenAI’s generative AI foundation models.”

I used the robotstxt.com checker to test and confirm that the page was supposedly blocked to ChatGPT, it confirmed that it was.

robots.txt check showing ChatGPT is supposed to be blocked

Yet, it’s abundantly clear that ChatGPT is continuing to train on this site’s content.

I would blame it on a pirate site or an unauthorized copy, but ChatGPT’s attribution cites plagiarismtoday.com for each item.

It also provides a link to the story at the bottom of the response.

ChatGPT providing attribution

Worse still, though ChatGPT did not copy my words. It presented the same three stories in the same order. In one case, I was the sole source cited, while in another, I served as the lead source. Finally, in the third, I was listed along with the Associated Press.

However, ChatGPT never cited the sources I used (and credited) in the column. Deadline, Torrentfreak and Seeking Alpha were not mentioned.

So, in addition to training on Plagiarism Today when it isn’t supposed to, ChatGPT relies on it and fails to direct users to the actual sources used.

This is a disservice to both me, as a creator, and the users of ChatGPT.

Weak Attribution, Wrong Answers

Attribution and AI are a hot topic. Earlier this month, I wrote about how, thanks to AI searches, most websites are seeing sharp decreases in traffic. This was driven home on July 14th in an article by Jason Koebler at 404 Media, where he highlighted that ChatGPT had only sent one paying subscriber to their site.

With that in mind, I’m surprised at how thoroughly ChatGPT credited Plagiarism Today. The result literally had four links back to the column.

However, there was nothing in the AI breakdown that couldn’t have been equally granted through a link. Though it added a “Why it Matters” section, two out of the three points were either wrong or misleading.

To be clear, the second and third points are wrong. MagisTV abandoned the trademark, prompting the USPTO to reject it. It had nothing to do with “skepticism of IP violations and cyber risks.”

The third is also misleading, as that story was about Rimini Street’s stock rising after settling a long-running and potentially devastating lawsuit with Oracle. It has nothing to do with infrastructure.

In short, the only thing ChatGPT did, other than rewrite my column, was provide inaccurate information about the stories presented within it.

All of it from a site that it wasn’t supposed to be on in the first place.

Bottom Line

To be clear, I recognize that there are many limitations to this. First, this is just one prompt gone awry. Many can and will argue that it’s a fluke. Some may say that I made a mistake, even though many others, including the New York Times, use the same code. There’s also the experimental robots exclusion protocol, “DisallowAITraining” that many say I should use.

I also recognize that this, as a query, is seen as unfair to AI. AI, especially free models, famously struggle to stay up to date. I asked it for news from the past day. Admittedly, it’s a surprise that free ChatGPT had anything to offer.

However, look at this from my perspective as an AI skeptic. I attempted to use AI to expedite a process that has become increasingly difficult and tedious. The result was that it rewrote a single URL, provided false information and still didn’t direct me to the original sites. The fact that it seemingly rewrote my work is just an extra twist.

It didn’t save me any time or work. It could have misled me about some of the stories, and though it directed me to its source, it didn’t go the extra step to actual sites the information came from.

While I’m proud of the 3 Count and use it as a model for ethical aggregation, that ethical aggregation is meaningless unless users are directed to the source. I try to do my part through large headlines and clear attribution, but ChatGPT did not.

Everyone would have been better served if ChatGPT had just provided a link to the column.

To make matters worse, as a webmaster and a writer, I wonder why ChatGPT was on my site in the first place. The robots.txt exclusion has been in place for nearly 2 years. Yet, it clearly accessed the site yesterday and regurgitated my then-latest 3 Count column.

My plan right now is to wait a few days and then update my robots.txt with information from SEO robotstxt.com. I don’t necessarily believe that it will prevent any AI system from indexing the site, but at least I can say that I tried.

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free