AI Search Fails at Citation

Columbia Journalism Review logo.

One of the recent trends in generative artificial intelligence has been AI search. Nearly every major AI company has launched some version of an AI search engine, often with lofty promises about what it can do.

For example, OpenAI says their ChatGPT Search will give you, “fast, timely answers with links to relevant web sources, which you would have previously needed to go to a search engine for.”

Likewise, Microsoft says that its Copilot search will “Even provide you with the websites where it got its information, so you can do further research on your own if you’d like.”

But how well do they perform at citation? Are they as reliable and robust as they claim? According to a new Columbia Journalism Review (CJR) study, the answer is a resounding no.

The CJR assigned eight separate “AI Search” tools the relatively straightforward task of finding the source of a quoted passage. However, in more than half of the tests, the AI systems failed to answer the question correctly.

In many cases, the AI search tools confidently gave incorrect answers. This indicates that AI systems can’t determine when their answers might be problematic and instead present false information as accurate.

It’s a damning look at how AI systems handle citation and a problem for those using systems.

The Basics of the Study

The CJR chose twenty publishers and then selected ten articles from each for a total of 200 articles. Passages of those articles were then fed to eight separate AI search tools.

The bots were then told to find four pieces of information.

  1. The corresponding article’s headline
  2. The original publisher
  3. The publication date
  4. The URL for the original article

The researchers then graded the results of the 1,600 total queries. If all four elements were correct, the result was “Completely Correct.” If the information was accurate but some elements were missing, the result was labeled “Correct But Incomplete.”

Finally, the results were either “Partially Incorrect” or “Completely Incorrect” if some or all answers were wrong. If the bot declined to answer, the researchers labeled it as “No Answer Provided.”

In total, the AI search tools answered more than 60% of the prompts either completely incorrectly or partially incorrectly. However, the answers were not even across the various tools. Perplexity, for example, answered “only” 37% of the queries incorrectly. Meanwhile, Grok had an error rate of 94%.

However, as bad as these results are, the study found something that may be even more concerning.

Ignoring Publishers’ Requests

Five of the eight chatbots that the CJR examined have made their crawler information public. This means that, in theory, publishers can block access to those bots using robots.txt and HTML headers.

However, the study found strong evidence that some bots ignore those requests.

The worst culprit was Perplexity. Both Perplexity and Perplexity Pro had nearly 90 articles that should have been off-limits. However, both answered 26 such queries entirely accurately. This included all 10 that involved National Geographic, which has supposedly blocked Perplexity’s crawlers.

A significant number of publishers also blocked ChatGPT and Gemini. However, ChatGPT only successfully answered three queries in full and partially answered four of the 70 articles that blocked them. In the other 63, it gave either wholly incorrect or partially incorrect information.

Gemini, on the other hand, only answered two of the one hundred blocked sources partially correctly. Seventy-three were either partially or entirely incorrect and it declined to answer another twenty-five.

Despite having known crawlers, Copilot was not blocked from any articles. This is likely due to it sharing crawlers with the Bing search engine.

Still, it is clear that Perplexity is ignoring robots.txt and other blocking measures. Meanwhile, it and other AI search engines are still giving incorrect responses to data it shouldn’t have access to.

What This Means

If you’re an AI search user, there isn’t much positive here. Even the best performer, Perplexity, was still wrong nearly 40% of the time. That is far too high of an error rate to rely on the information.

To be clear, this shouldn’t have been a very challenging test. These were direct quotes from the articles, not ideas or notions.

For a traditional search engine, this would have been straightforward. While it would have required some human effort, finding all the correct information wouldn’t have taken long.

However, the selling point of AI search tools is that they can eliminate or reduce this human effort. But how does it reduce that effort if one must always double-check the results?

The very usefulness of AI search relies upon its accuracy. This study makes it clear that the advertised accuracy is an illusion.

While the tools may improve, AI search faces the same numbers game as AI detectors. Even if they reach 99% accuracy, humans will still need to validate their results. That validation, in turn, defeats much of the benefit.

This impacts journalism, academia, and research, three spaces where citation accuracy is critical. Right now, the use of AI search engines should be heavily discouraged in these spaces.

In short, if you value accuracy, you need to know that you can trust the tools you rely on.

Bottom Line

While some will likely argue that this was an unfair test of AI search engines, it should have been low-hanging fruit. Finding direct quotes is something non-AI search engines do every day.

However, the issue isn’t that the bots couldn’t find the correct answers; it’s that they often presented the wrong answers. It would be one thing if the chatbots didn’t return a result. Maybe they aren’t suited for this task.

But presenting false information, often with seeming confidence, is far more damning than providing no information. Even worse, some of the correct information was supposed to be off-limits.

This paints a picture of AI search tools that ignore the wishes of human creators and still manage to get so much information wrong. It’s a sad image.

In the end, even if these tools improve, they will struggle to reach a level humans can blindly rely upon them. The same problems that are holding back AI detectors hold back AI itself.

While generative AI and AI search are both amazing pieces of technology, they have not reached that level of accuracy. If this study is anything to go by, they have a very long way to go.

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free