Review: The Plagiarism Checker
By Jonathan Bailey • Dec 16th, 2008 • Category: Articles, Products
Late last week, a post reached the front page of Reddit that piqued the curiosity of copyright holders, teachers and professors alike. It was about a service called “The Plagiarism Checker” (dubbed by me the “Dustball” checker due to its domain), created by Brian Klug in 2002, when he was a student at the University of Maryland at College Park, and abandoned until recently this year.
The site, according to Klug, was getting about 2,000 visits per day when it was forgotten but is almost certainly doing much better now as it has taken off, attracting countless Twitter Tweets and other social news attention. Librarians and teachers are especially captivated by this site.
But is “The Plagiarism Checker” worth using? Is it as powerful of a tool as some, although not the site itself, have made it to be? The sad answer is no, but it could, with a few simple tweaks, become a much more useful service for teachers and bloggers alike.
How it Works
The basic premise of the minimalist site can be summed up by its instructions:
Cut & paste your students paper or homework assignment into the box below, and click the “check” button. This free plagiarism detector will find plagiarized text in homework and other essays/reports.
In short, you take an essay, article or other lengthy prose work, paste it into a textbox and hit “check”. From there, the site extracts several strings of text, runs them through Google and compiles the result, determining whether plagiarism is probable.
In that regard, the idea is actually very similar to Copyscape, which also uses Google via their API, to process results. However, where Copyscape’s keeps the “magic” hidden from the user, the “Dustball” plagiarism checker includes links to the Google results, encouraging users to click through and research the case for themselves.
That alone is a big part of the problem Webmasters, and many teachers, will have with the service. Where Copyscape, as well as academic tools such as TurnItIn, provide very simple and colorful results, The Plagiarism Checker is a very bare-bones approach, requiring the user to perform a large amount of research on their own.
Still, a bit of research will be welcomed if the service produces great results, unfortunately, it seems that the service performs only lukewarm, at best.
My Tests
To test the service, I decided to run it through a similar battery of tests that I had run Copyscape through and then watched as they improved upon the initial results.
The first test was to run an old poem of mine through the system, one that allegedly has over 300 matches in Google. However, that test was thwarted as The Plagiarism Checker refused to even look at the work, saying that it could not function with such short text strings.
I then shifted gears and started using prose works, the first being one that had 36 matches in Google at the time I did the search. The result was stunning.
Despite the fact Google had reported three dozen matches on test snippets from the work itself, the “Dustball” checker was unable to find anything. To make matters worse, using some of the sample quotes from the test, I was able to locate other copies of the work, such as with the first quote.
Clearly, The Plagiarism Checker was missing results that Google was finding, meaning it was discarding them for whatever reason.
A similar test for another prose work only returned one sentence that was matched against anything and the results for it were all false positives. This work, in Google, has six results.
The only search using the service that seemed to work remotely well was when I ran the Declaration of Independence through it. Every search term, in this test, came back positive.
It appears that text that is not widely distributed around the Web may or may not show up as plagiarized in this work, something that has me very worried as many are starting to rely on this plagiarism checker as their main tool for detecting both copyright infringement and the plagiarism of students.
The Sad Truth
Simply put, any and all of these search results should have come back as being plagiarized. Even if there were no other matches of the content, these works existed on my site and are available through Google there. There is no reason that any of these works should have come back as anything short of 100% plagiarized since this site can not know I was the one submitting them.
For teachers, this is not good news. Is a student plagiarizes material from obscure sources, they are likely to escape detection. Likewise, Webmasters and those that might want to use this tool to track their own content, will likely be disappointed that it doesn’t seem to pick up when the infringement is only a few dozen sites.
This can most likely be fixed through tweaks in the algorithm, but as it sits right now, it doesn’t appear that it has much to offer teachers or Webmasters, especially when Copyscape is relatively effective and cheap to use.
Simply put, at this moment, Copyscape is easier, more effective and faster than The Plagiarism Checker and, at only five cents a search, is affordable too.
However, the best technique still appears to be taking the time to select good phrases from a work and manually searching for those. It returns the most results and seems to work well nearly all of the time.
The Big Picture
My issue with The Plagiarism Checker has less to do with the service itself and more to do with how others have been promoting it. The site itself is actually fairly humble about what it can do, but bloggers and Twitter users have been advertising it as if it were a silver bullet to detect plagiarism. Clearly, that is not the case.
With a few tweaks and fixes to the algorithm, I don’t doubt that this service, much like Copyscape, could become a very powerful tool. However, even if the results were on par with Copyscape, the latter remains faster and easier to use, meaning that there will not be much reason to use the “Dustball” checker.
To make matters worse, most teachers and professors have access to services such as TurnItIn that are far more accurate and covers a much larger breadth of sources than “The Plagiarism Checker”. Considering the ease of us and added features, there is not much that can be gleaned from a Google-only search, that can’t be gleaned from the more automated service (Though Copyscape did top Turnitin in a recent plagiarism detection study).
In short, I don’t see much usefulness for this tool, even if its accuracy improves, and I and more than a little confused as to why so many seem to have promoted it so heavily.
Conclusions
More than anything, this is a case against the reliance on any one plagiarism checking service. Even the best services will let results slip through the cracks. Furthermore, just because a service is popular does not mean that it should be trusted above all.
However, I find it very difficult to fault The Plagiarism Checker for this confusion and these problems. It is clear that the service was as much an experiment as anything, it is promoted humbly and was actually abandoned for approximately six years. It was others, perhaps desperate for some way to more effectively detect plagiarism, that gave it an unjustified reputation.
If anything, this case shows the need and the potential market for such services and illustrates why some companies have made millions in this field. People are eager for a solution and are excited by any promise of one.
Sadly though, this site is not the one people are looking for.
|
|
Protect Your Work. Subscribe to Plagiarism Today via Email or RSS. |
Jonathan Bailey is The Webmaster and author of Plagiarism Today, which he founded in 2005 as a way to help Webmasters going through content theft problems get accurate information and stay up to date on the rapidly-changing field. He is also a consultant to Webmasters and companies to help them devise practical content protection strategies and develop good copyright policies.
Email this author | All posts by Jonathan Bailey





Hi Jonathan,
just ran a few of our tests through. A 100% plagiarism raised two concerns (both correct), but only because this 52 or so letter slice did not contain any Umlauts….
When you do get a hit, you still have to leaf through the possible source to look for the place the plagiarism was taken from. Copyscape does a nicer job of markup.
It does not find multiple sources and misses some of the more esoteric ones completely. Not a good workflow fit, and searching with Google is faster.
Agreed completely on all fronts. I can see what the creator was trying to do but between the lackluster matching and the poor workflow, there isn't much here for anyone to sink their teeth into.
I came to the same conclusions when I sampled it. Thank you for this detailed write-up. I am always looking for great plagiarism tools that don’t violate FERPA (Turnitin, IMHO).
I had suspected it from the moment I laid eyes on it but still wanted to test it out thoroughly before I tossed it aside, I'd hate to pass over the greatest plagiarism checking tool just because of an intuition. Sadly, it appears I saw right.
I applied for copyright certificates for 3 of my works (greeting cards). The first work was submitted to US Copyright Office on December 10, 2007. The second & third were submitted at the end of May, 2008. It normally takes 4 months to issue a copyright certificate. However, I still haven't gotten any responses. What do I do? Any suggestions? Is this the way a Federal Agency supposed to operate? I mean, just because the office receives 10,000 works (in one day) from copyright applicants, does not grant the office the right to mess me up! Does it? Anyone else have a similar situation?
The best thing that I can tell you to do is to call them and ask them what is going on. They have their numbers here: http://www.copyright.gov/help/
I'm the first to agree that the USCO is a bag of hurt and doesn't function even remotely as it should, but I sadly can't help shed any light on this matter. The few times I've done it I've only had one major issue and that was caused by their radiation equipment destroying CDs I sent in.
My advice is to get in touch with them either via email or phone and see what they can do. They do respond, at least the few times I've contacted them.
The articles describes the Calendar are cyclical anomalies in returns, where the cycle is based on the calendar. The most important calendar anomalies are the January effect, the turn of the month effect and the weekend effect. The quality of studies finding evidence of different market anomalies are too overwhelming to simply ignore and just write off as temporary miss pricings according to efficient market theory.
The analysis has been conducted to evaluate that investor have to be critical not to over interpret results with the risk of neglecting and under estimating the importance of such a basic concept as portfolio diversification.
This journal shows the significance of calendar trading rules is much weaker when it is assessed in the context of global rules that could reasonably have been evaluated. Evidence provided that daily abnormal returns in January have large means relative to remaining eleven month and small firms experience large returns in January and exceptionally large returns during the first few trading days of January and it is also closely associated to the tax loss selling induced by negative returns over the previous years.
Where as, the week end effect refers to tendency of stocks to exhibit relatively large returns on Fridays compared to those of Monday. Monday effect is associated to the regularities in trading pattern of individual and institutional investor related to the day of the week. We find a relative increase in trading activity by individual on Mondays, in addition, there is no tendency to increase the number of sell transactions relative to buy transaction, which might be due to information which individuals collects over the week end.
Last it’s the month end effect, that existence of positive returns only in the first half of the month, and more specifically where the last day of one month and the first three of the next month are particularly high. Which primarily due to higher month end cash flows such as salaries, dividend and interest payments.
The journal acknowledge that there might exit short term anomalies but that these will in a longer perspective be cancelled out so that the market can go back to being perfect efficient. There is no gurantee that markets will be perfect efficient in the short run however as an investor specialized Technical analysis uses from past patterns of price and the volume of trading as the basis for predicting future prices in detecting anomalies and arbitrage opportunities will not be able to attract any abnormal returns due to irregular nature of these anomalies. The random-walk evidence suggests that prices of securities are affected by news. Favorable news will push up the price and vice versa. It is therefore appropriate to question the value of technical analysis as a means of choosing security investments.
Fundamentals analysis involves using market information to determine the essential value of securities in order to identify those securities that are undervalued. However semi strong form market efficiency suggests that fundamentals analysis cannot be used to outperform the market. In an efficient market, equity research and valuation would be a costly task that provided no benefits. The odds of finding an undervalued stock should be random (50/50). Most of the time, the benefits from information collection and equity research would not cover the costs of doing the research.
For optimal investment strategies, investors are suggested should follow a passive investment strategy, which makes no attempt to beat the market. Investors should not select securities randomly according to their risk aversion or the tax positions. This dose not means that there is no portfolio management. In an efficient market, it would be superior strategy to have a randomly diversifying across securities, carrying little or no information cost and minimal execution costs in order to optimize the returns.
The basic question related to market anomalies in whether an identified anomaly is evidence of a stable and long run phenomenon which an investment strategy could be based on or if it is just as the names suggest a short term unique miss pricing.