5 Reasons Plagiarism Detection is So Difficult

Difficult ClimbIf you’re looking for a plagiarism detection service, there are many to choose from. Well over three dozen companies and services right now are available to stand up and take a crack at help you either determine if a work is original or find plagiarized copies of a work elsewhere online.

However, many, if not most, of them are complete garbage.

So while the number of services that help people detect copied text is growing, most of them have been of very low quality and of very limited usefulness. This is something that’s highlighted in Dr. Deborah Weber-Wulff’s regular tests of plagiarism detection systems (detailed writeup coming).

But why is this? With so many companies entering the market and engineers addressing the problems, new plagiarism detection services should be getting exponentially better, not lagging well behind established players.

The reason is for this is simple: Plagiarism detection is very difficult and, sadly, technology isn’t really able to address the challenges that the industry faces.

So if you’re looking to create a new plagiarism detection service, here’s what you’re up against and the problems you are going to after. If you’re the user of such a service, here are the reasons why, at least in many cases, your results may be less than satisfactory.

1. Database Size

A plagiarism detection service is only as useful as the amount of content it is able to search through. After all, if the matching copy isn’t in the database, even the best tool can’t find it.

While search APIs provided By Google and others have given developers the ability to search the entirety of the visible Web. However, even online these searches miss a great deal. For example, it’s estimated that Google only indexes about 10% of Web-based content with the rest of it hidden behind the “invisible Web” of temporary pages, content hidden behind logins and so on.

In addition to that, there’s also a great deal of content that’s not available on the Web. Books, journals, old news articles, etc. may not be indexed online. This can be especially troublesome for academic or research-oriented plagiarism.

Only one company, iParadigms, has invested heavily in securing partnerships with publishers of academic and research materials (Disclosure: I am a paid consultant for iThenticate, an iParadigms child company.). However, those partnerships came at great expense and took years to set up.

In short, most new plagiarism detection services are going to have a limited database size and there’s not much that can be done about it.

2. Finding Matches

If you have a large enough of a database, you still have to find matching content within it. The science of search is well-studied but, with plagiarism detection, it’s much more complicated than usual.

Typically, this matching is done in two phases. The first is to look for works that are suspiciously similar. This can be done by either using test strings from the submitted document or fingerprinting it digitally to find works that may be similar. Those works are then evaluated more closely and the actual amount of matching is determined.

The challenge of this system is finding a way to make it accurate enough to detect smaller matches that are still plagiarism but not contain too many false positives. This is a tough balance and can be difficult even when just evaluating two documents, let alone one against billions.

Most plagiarism services are prone to either false positives or false negatives. But while some are almost impossible to avoid, many systems are so hamstrung by the errors as to be effectively useless.

3. Displaying the Results

Once the matches are found, the software then has the difficult task of displaying the information it’s gathered. Most complicated of all, it has to take a relative mountain of complicated data and try to explain it to the user, who may not be adept at interpreting such reports, in a concise way.

Trying to inform the user without overwhelming them is a difficult game and services either risk oversimplifying the data, giving the impression that no human analysis is needed, or presenting too much data and making such an analysis almost impossible.

Getting it just right can be difficult and it’s made more complex by the fact that there are two types of plagiarism detection tools, ones designed to test the originality of a suspect work and another to find matches against a known original.

Unfortunately, many services punt on the interface issue and either simply provide a collection of links or try to boil down the analysis to a simple percentage with little additional information. Either way, it does the user a disservice.

4. Business Model

If you’re able to build a great plagiarism checker, building a good business model might be even more elusive.

While the need and demand for plagiarism detection tools is at an all-time high, as with most tech services, most people are looking for free tools that they can use. Whether it’s for a quick check or a longer-term need, budgets for plagiarism detection tend to be scarce and, outside of a few markets, there’s a lot of hesitation to pay.

While many plagiarism detection services have managed to stay afloat, most remain small operations or even part time jobs. Plagiarism detection doesn’t have the mass market appeal to be a consumer product and there is limited room for those who want to target academic or enterprise customers.

In short, keeping the doors open can be a very serious struggle.

5. Other Types of Plagiarism

But even if you do all of this correctly and build the best, most profitable plagiarism detection service known to man, you are still going to be dealing with some very serious limitations.

First off, there are many types of plagiarism that can’t be detected through text matching. Those include paraphrased plagiarism, plagiarism of ideas and, in many cases, translated plagiarism (Note: Turnitin, another child company of iParadigms, launched translated plagiarism detection last year).

However, perhaps the biggest problem is that only a human can determine whether or not section of duplicate text is plagiarized. The goal of the software is not to “detect” plagiarism, but rather, to help in the analysis.

While smart plagiarism detection tools understand that and work to meet that need, convincing the user of this can be much more difficult, especially when many are seeking a magic box to find all of the plagiarism for them, without any work on their behalf.

Bottom Line

While this post might seem very down on plagiarism detection tools, it isn’t meant to be. These challenges are very tough and it’s no surprise that most services that attempt to enter this field don’t excel as well as established players.

Knowing and understanding these challenges make me greatly respect companies like iParadigms, Copyscape and PlagScan that manage to a good job.

However, this also explains why I am so skeptical of new companies that come along and claim that they can take a place in the field immediately.

But when it’s all said and done, despite their limitations, plagiarism detection tools provide a valuable service, making the analysis of works for plagiarism much faster, more accurate and eaiser than before. While they aren’t perfect, we are definitely better off with them than without them and, when used correctly, they are worth every penny they cost.