Yesterday, a concerned mother approached me worried that a poem being submitted to a local poetry contests was less than original. I offered some basic advice on how to search for the work in Google but, despite her best efforts, no results came up.
I then had her forward the poem on to me and I began to use my tools to try and analyze it. After a few futile Google searches of my own, I decided to give Copyscape a try before using even more advanced tools.
However, further searching proved unnecessary as Copyscape detected a match and it immediately became clear why Google searches were failing, the poem in question was not a verbatim plagiarism, but a derivative with more than a few modifications.
It highlighted the point that Copyscape, while perhaps not ideal for verbatim plagiarism, is best at detecting derivative works and performing the kind of thorough, line-by-line checking that simply is not practical for humans.
Why It Works
Humans and computers search for duplicates much the same way, by punching in strings of words into a search engine and comparing the sites that come up to the original. The difference, however, is that the Copyscape, and similar systems, do it much faster and process many more strings.
If someone takes your work and rewrites it, producing a derivative of the work, they are going to keep at least some of the original wording. That means there is searchable text to match in the derivative, unless the plagiarist changed at least one in every five or six words, but finding it is nearly impossible.
The problem is that the matching strings may cross punctuation or even paragraph breaks. Humans, in order to use hand search techniques, need to first remove all of the punctuation and then search for a string beginning with every word in the piece.
That could take hours, or even days, especially when you have to filter out false positives.
Computers, however, typically ignore such punctuation to start with and search for a large number of strings rapidly. If there is a match, they hold it and see if it matches for other strings as well. This works for pretty much any copy detection system including Copyscape, Turnitin and BitScan.
Though the technique can detect verbatim plagiarism, it works best in situations where the amount of work copied word for word is low or moderate. This is not just where these tools provide a unique and valuable service, but where they perform the best.
The Problem with Verbatim Detection
The reason these services generally do a poor job at detecting widespread verbatim plagiarism is because of the nature of the searches.
If you search for a string of words in Goggle, the search engine will spit out all pages that match that text. If you choose a good phrase, all of those sites should have content lifted from your work.
However, with a Copyscape search, the computer has to process hundreds of search results, parse them and decide which sites are matches. How such services process these results are closely held secrets, but its safe to say that if the bar for determining a match is set too low, there will be a large number of false positives, too high, many actual matches will be discarded.
If you’re using such a system to detect widespread verbatim plagiarism, some of the matches are going to be discarded. This is because the search results received are not predictable and, depending on how the backend handles the process, it could discard some results it is unsure of in a bid to prevent false positives.
However, this isn’t to say that there aren’t times that using such an automated system is preferable. There are many situations where it is practically necessary.
There are some types of work that are almost destined to be turned into unwanted derivative works.
Academic works are the most obvious target. Plagiarism in student papers, research reports and even academic journals are widespread and the vast majority of it involves at least some rewriting.
When you consider that, it is no shock that academics have had systems such as Turnitin for many years now as they are solutions targeted directly at that problem. In fact, many of the solutions we use to day are based, at least in spirit, upon these tools.
Another category of work likely to be rewritten as it is being plagiarized is marketing and promotional copy. In the fields of advertising and PR, effective copy is hard to write and similarities are expected. However, on the Web it often goes beyond taking ideas and into the realm of copyright infringement. However, since originality is still prized, both by humans and search engines, the copying is generally not verbatim.
Finally, artistic works, especially poetry and short stories, are particularly vulnerable to unwanted derivatives. Sometimes it is simple to detect, as when the plagiarist changes a character’s name or substitutes a few words in a poem, other times it can involve massive edits to the work, leaving it almost completely different from the original.
If you work heavily in any one of these categories, it is probably worth your while to do occasional searches for your work using a system such as Copyscape to help detect derivative use of your work.
The systems may not be perfect, but they are certainly preferable to doing the checks by hand.
Search engines were never designed to detect duplicate content. They were created to locate keywords and search terms within sites and determine which of those sites deserves to be ranked the highest.
As such, the tools we use to detect copies of our work have to push the search engines to their limit and twist their features to generate unintended results.
Fortunately, new tools are being developed from the ground up to index and monitor content across the Web. As these tools improve and grow, they will likely be able to replace these hacks and tweaks for those seeking to search out their own content.
When that happens, not only will the search engines be able to better focus on what they do best, but we will have much better data about who is using our work and how it is affecting us.
The push in this area has really just begun and it will be an area of radical change in the coming months and years.
This is the first in a short series on search engines and plagiarism detection. Next time, I’m going to do a head-to-head comparison of five of the top search engines to see which works best at detecting verbatim copies of your work.