Why AI Has a Plagiarism Problem

Last week, Randall Lane, Forbes’ chief content officer, published an article targeting the AI company Perplixity, accusing it of plagiarizing Forbes’ original reporting.

Perplexity describes itself as an “AI-powered answer engine” that provides answers to user questions. However, according to Lane, the “answer engine” fails to provide clear attribution or avoid using original text.

Lane highlights work by Sarah Emerson and Rich Nieva, two Forbes reporters. Emerson and Nieva have been working on a long-term series about a Google drone project. Lane claims that Perplexity copied some of their language in an answer on the topic and provided only minimal attribution.

To make matters worse, Perplexity also created a podcast and a video based on its answer, neither of which gave attribution to Forbes. Lane also noted that, even when Perplexity did give attribution, it equally attributed secondary sources.

When asked about the issue, Perplexity CEO Aravind Srinivas said the product had some “rough edges.” Perplexity neither removed nor corrected the story.

Wired Magazine added fuel to the fire with its own expose on Perplexity. There, it accused Perplexity of ignoring publisher wishes and indexing sites that should be off-limits to them. Ironically, they also found that Perplexity doesn’t always visit the page it’s supposed to summarize and, instead, invents a story.

So why does AI have such a severe problem with plagiarism? It’s because plagiarism is hard.

AI Plagiarism: Two Issues, Same Problem

When looking at plagiarism issues in generative AI systems, there are two problems with a common root cause.

The first problem is that AI repeats content verbatim or near-verbatim, even when it is supposed to be wholly original. According to a February report by Copyleaks, nearly 46% of all ChatGPT-3.5 outputs tested contained identical text. Many others contained text that was lightly edited or poorly paraphrased.

Though ChatGPT-4.0 does seem to have reduced those issues, they are still prevalent and are one method creators discover AI systems have trained on their work.

The second problem is attribution. Many AI systems don’t provide attribution at all. However, those who do often do so inadequately or cite the wrong sources. This was a significant issue for Forbes, which argued that Perplexity gave secondary sources equal weight to their original reporting.

But why are AI systems struggling with this? It would appear that any system capable of writing “original” content could avoid verbatim plagiarism and handle attribution.

However, as AI companies are learning, plagiarism and attribution are complicated, nuanced topics. They are also topics that algorithms have long struggled with.

Algorithms, Plagiarism and Citation

For all the hype around AI, such systems are advanced algorithms that attempt to guess the next word. Despite being dubbed “artificial intelligence,” they don’t understand the material they’re reading or writing.

Though AI’s capabilities are impressive, it’s important to remember its limitations. AI struggles in spaces where the rules are unclear, including plagiarism and attribution.

For example, writing a rule to bar the “regurgitating” of any text would not work. First, there are many scenarios where it’s acceptable to have copied text, even without quotation marks.

For example, if I wrote the sentence “Sherlock Holmes lived at 221B Baker Street,” one wouldn’t expect me to put it in quotes (other than for clarity here). However, according to Google, that sentence is on hundreds of pages.

However, if I wrote, “Referring to himself as a ‘consulting detective’ in his stories, Holmes is known for his proficiency with observation, deduction, forensic science and logical reasoning.” I should probably cite the Wikipedia entry from which I copied it.

Sometimes, “regurgitating” is fine, even necessary. Other times, it’s not. The difference is nuanced and depends on the amount copied, how common the text is, how it’s being used and what type of work it is being put in.

In short, there’s no bright-line rule. If there were, it would be different for each type of writing. An AI that understood citation for journalists would likely struggle when tasked with writing a research paper.

Then there’s the issue of sources. Some AI systems can’t even recall what sources they used. However, the systems that can recall still struggle to distinguish between authoritative and secondary sources.

Google has wrestled with this for decades. Though the search engine has long tried to boost authoritative sources, it has repeatedly failed. This includes ranking works of plagiarism above the source.

This problem also explains why AIs sometimes get things wrong. For example, Google’s AI recently recommended putting glue on pizza and eating one rock per day. It was, in part, because it trusted parody sites as authoritative.

Humans regularly make these mistakes, and any AI system will be worse than humans. It’s just that simple.

Bottom Line

The Ninety-Ninety rule reads as follows:

The first 90 percent of the code accounts for the first 90 percent of the development time. The remaining 10 percent of the code accounts for the other 90 percent of the development time.

Tom Cargill, Bell Labs

Though it was meant to be a humorous take on how many coding projects languish at the 90% mark, it also explains some of the problems here.

In many ways, AI is at that 90% mark. What it can do is genuinely impressive. However, before humans can trust it, AI must understand citation, authority, plagiarism and more. These are complicated and nuanced topics. Worse still, the rules are constantly shifting and changing. These are moving targets.

Will AI figure these issues out? I’m skeptical. As we’ve seen repeatedly, technology can often make grand leaps and seemingly get very close to solving a major problem. Then, inches from the goal line, it struggles with edge cases and nuances it must conquer before reaching actual usability.

For an example, look no further than Tesla’s “self-driving” cars. Though what they can do is impressive, they are nowhere near truly self-driving, which has made them the target of a lawsuit over their claims.

AI is in a similar place. It might be able to do some impressive things, but it can also drive you into a train.

Fun Side Note

As part of my research for this article, I experimented briefly with Perplexity. I found its answer to one question fairly telling.