The Problem with Detecting Translated Plagiarism

Jonathan BaileyFebruary 24, 2011

5 minutes read

Google Translate Logo Dr. Deborah Weber-Wulff is possibly best known for her recurring and stringent testing of plagiarism detection systems. Every year or two, she publishes a thorough report of her findings, as she did in January, that serves as probably the best barometer of such systems.

But while the “best” systems change out with almost every test, there is an overall trend of at least slow progress in the systems, especially in detecting smaller bits of plagiarism.

However, there is one area where there has been almost no progress, translated plagiarism. Even in the 2011 test results, no system was able to effectively deal with the issue of translated plagiarism and, for the foreseeable future, that is likely to remain the case.

The reason is that the way automated systems detect plagiarism isn’t very well-geared toward detecting translated plagiarism and it’s unlikely any automated system, at least not for some time, will be very effective at it.

But this doesn’t mean that translated plagiarism will become rampant or even a safe haven for plagiarists. Just that other detection methods will need to be used.

How “Plagiarism” Detection Works

To be completely clear, plagiarism detection systems don’t actually detect plagiarism at all, they detect copies. The exact methods vary but the results are the same, the systems look for matching phrases and, when they are detected, begins to look deeper and see if there is more extensive matching between the documents.

The system is incredibly efficient and it enables automated plagiarism checkers to look through a mind-bogglingly large amount of content for any similarities. However, the weakness of the system is that it relies on exact matches. If you can change out enough words, you can easily fool plagiarism checkers. This is precisely how synonymized or “spinning” plagiarism works.

To counter this, plagiarism checkers have routinely narrowed the holes in their net. According to iParadigms, the makers of Turnitin and iThenticate, one would have to change one out of every three words in an essay or article to be reasonably assured it wouldn’t trip their detection. In fact, according to a lecture I attended at the 3rd International Plagiarism Conference, the existing systems did a reasonable job, even in 2008.

However, translated plagiarism takes the problem exposed by spinning plagiarism and magnifies it many fold, pushing it well past what our current systems can handle.

Why Translated Plagiarism is So Difficult

There are at least three problems in trying to detect translated plagiarism and they each combine to make a near-impossible problem for automated systems to crack.

There is No One Right Way to Translate a Word: “Gato” may be Spanish for “Cat” but it also might mean “Feline”, “Tabby” and a slew of other synonyms. There is a lot of nuance to these words, but an automated system will see them as completely different.
Different Languages Have Different Grammar Structures: Though most languages in a family will have a similar structure, even subtle nuances between languages will make it so that a word-for-word translation is not possible as a sentence in one language will have to be completely rewritten to be correct in another.
No Effective Automatic Translation System: Bad Translator illustrations this problem pretty well. It translates text from one language to another and then back again to English, often with hilarious results. Automated translation systems work well enough to be understood, but not well enough to detect exact matches.

What this all adds up to is that, with translation, there is simply too much in the way of nuance and interpretation to be able to create an exact match out of translated plagiarism, at least with any reliability. Add to that the fact most translated plagiarism also has some element of rewriting built into it, there isn’t much that can be done.

A translated version of a work, once translated back, just isn’t going to match up with the original in any significant way, at least not in a way that can be detected through automatic systems.

Plagiarism detection systems could counter this by casting a wider net, accepting looser matches and trying to accept synonyms of words as exact matches. However, not only does this increase the amount of processing power needed to do detection, but it also opens up the door to false positives.

Automated plagiarism detection systems, in order to have any usefulness, need to strike a balance between catching a reasonable amount of matches and not returning too large a number of false positives. Too few actual matches, plagiarized works slip through routinely, too many false positives and the the results are useless and impossible to go through.

Casting a net wide enough to find translated plagiarism would, almost without a doubt, would generate a lot of false positives, especially when dealing with several papers on the same or similar topic.

This means that, for the most part, automated detection of translated plagiarism will be all but impossible. Though checkers can and will get lucky from time to time, they won’t be able to do it reliably. However, the news is not all bad, in fact, the problem isn’t as great as many likely believe it to be.

The Good News

The good news in all of this is the same good news that comes out of spinning plagiarism: Doing a quick job with it produces shoddy results, doing a good job with it requires a great deal of time.

There is a reason high-quality translation services are both difficult to find and expensive to procure. Good translation is difficult, time consuming and very specialized. The odds of a potential plagiarist being able to do a high-quality translation of a work in less time than it would have taken to simply produce an original creation with the other work cited correctly are slim.

For the most part, people who commit acts of translated plagiarism will be caught not by an automated system, but by teachers and professors who notice a change in the student’s writing or see clear errors in the work.

For this kind of plagiarism, humans are always going to be the best weapon as we are able to spot the imperfections such plagiarism inevitably create.

In short, the mere fact automated systems aren’t able to easily detect translated plagiarism doesn’t mean that those who go that route will easily get away with the deed. In fact, they may even be more easily caught as human intuition is often easier to interpret than a plagiarism report.

Bottom Line

It is highly unlikely with the current approach of automated plagiarism detection that we will be able to spot translated plagiarism reliably. The current system just isn’t geared for it and casting a net wide enough to find it would also ensnare far too many false positives to be useful.

This just goes to show that plagiarism detection systems, no matter how good and useful they are, can never be magic “catch all plagiarism” machines. We can not simply turn over our judgment on what is and is not plagiarism to any automated system because they will always have weaknesses and problems.

Plagiarism detection always requires a human element and the automated systems are merely tools that help the humans do a better job of figuring out what is copied.

Once educators, content creators and everyone else learns that, they can start using these tools as they were intended. That, in turn, will make the plagiarism atmosphere both in academia and online a lot clearer.