It has become all the rage in recent months for programmers to build or revamp plagiarism checkers using Google and other search engines. Most of these plagiarism checkers, such as the “Dustball” checker, fail to produce adequate results.
The problem is that phrase selection is not simple task. It can be difficult for human beings to determine what phrases or sentences to search for, let alone a simple algorithm. As a result, such simplistic plagiarism checkers often times either miss a large number of results by choosing phrases that don’t work well with the search engines or produce a slew of false positives by selecting too common or too short of terms.
Though a cursory search proved many of my original suspicions, it also showed that the plagiarism checker isn’t quite as useless as many of its brethren. Though it has its flaws and certainly isn’t as useful as its marketing might say, it does have some interesting features and potential compelling uses.
How it Works
The 10DA checker works like many similar services. Users copy an article or piece of content that they want to check for plagiarism, they then choose up to three services to use and hit the submit button.
The service then selects a series of five or six snippets from the work and runs them through each of the search indexes checked against. When it’s done, it links to each of the results pages and the user can go through the results to see if there are any suspicious matches.
The site keeps track of which results pages you have already visited, turning those numbers to black, and also lets you recheck the article with a different set of search engines.
The real benefit of this system is that it is extremely simple to use and free. Since the product is in beta, anyone can paste text in and run it through the system.
The service is also stands out somewhat in that it allows users to run the search through different search engines, unlike others that focus solely on Google. With the 10DA checker, you can easily search MSN and Yahoo! as well as Bloglines and more. Though many of the choices seem superfluous, especially the multiple Google services normally covered underneath the main search (such as searching either Wikipedia or Google Knol) the addition of extra choices is an interesting one.
That being said, it isn’t the first checker to offer this service, others have been doing so for some time.
Though it is unclear how much benefit one gets from running the same article through three different engines, it is easy to see how those eager to be extremely thorough may be tempted by that feature.
Where the 10DA checker struggles the most is in the value that it adds, or lack thereof. Where Copyscape compiles the results from the various Google queries it makes and displays them in a simple results page, 10DA requires users to click through to each individual results page and do the actual legwork themselves. At this time, the 10DA checker does not even provide indication of the number of matches in the specific results pages.
Due to this, the results that one gets from the 10DA checker could be easily replicated by going to the individual search engines and doing the searches for yourself. The 10DA checker does not even automatically select to view similar matches, meaning that the initial display only includes one or two copies of the work in question.
With no match highlighting, organization or other input from the system, essentially it is the same as performing 5-18 individual searches at once. Since only one search is usually all that is necessary to prove that a work is plagiarized, one has to wonder how useful this really is.
As with most plagiarism checkers I review, I ran the site through a short series of tests to see how the results compared with stock Google searches. Since the system still primarily uses Google, this would be a true “apples to apples” comparison.
The first test involved a prose work of mine that I know has been plagiarized many times before. I ran it through the 10DA checker and the best result of the six phrases checked in Google was 26 matches. However, after tinkering with the search term, namely by shortening it and removing punctuation, I was able to improve it to 31 results.
The reason for this is that the phrases the 10DA checker chooses seem, to me, to be extremely long. Where I can usually find a good statistically improbable phrase between 7-9 words long, all of the phrases chosen by the 10DA checker were over a dozen words, some even grow as long as 19.
Though the longer strings do reduce false positives, choosing a good unique phrase is more important in that regard. This is something that the 10DA checker struggled with as some of the results had only one match, indicating that the phrase selected was of poor quality.
I also quickly tested the checker with a poem that I knew to be heavily plagiarized. However, many of the matches, due to an issue with apostrophes, came back as false negatives. Of those that did, the highest had 25 results but, once again, by tweaking the search term, I was able to increase that number 28. However, using my own phrase, I was able to find several hundred results.
|My Phrase Results:|
(Note: The high number of results from my phrase are likely due in large part to matches on the same domain. However, in a cursory check of the first few pages of results, I did see at least some positive matches that were not in the first two.)
The end result is that most people will find it pretty trivial to get better results than the 10DA checker. If they can look at the phrase selected, remove punctuation and pull out a good section of unique content, they can increase the effectiveness of the search.
However, why one would do that is a bit of a mystery. If you’re going through all of these motions and need the added matches that come from a better phrase, you’re probably going to find it faster and easier just to pull the phrase yourself directly from the content and then perform your own search.
Even though the site’s marketing material says that it is both a competitor and a compliment to Copyscape, Copyscape is by far a more useful service. Though 10DA seems to be about on par with the number of matches Copyscape catches, the usability of Copyscape is much higher and well worth the five cents per search in most cases.
Still, if you’re looking to do a quick plagiarism check of an article before you post it on your site, something my wife has to do as her company’s blog editor, it might be a useful service. If you don’t feel like setting up a Copyscape account or don’t mind the extra step of visiting the results, then it could be useful.
However, I can not recommend this service for checking for duplicate content of your site’s material. You can get more accurate matches by hand and the amount of energy that is saved by using the 10DA checker is pretty minimal. Even the free version of Copyscape provides good matching and a much higher usability.
But even that seems somewhat defeatist. With Fairshare bringing professional-grade matching technology and automatic updates to bloggers, there is no reason that bloggers or other RSS providers should be punching in their articles by hand to check for plagiarism.
In short, the age of copying and pasting textual content to see where it has appeared on the Web is fast ending. That is good news though as the easier to use and more automated the systems become, the more likely bloggers and other writers are to use them.
Hopefully, similar systems for images, audio and video are also fast coming.