Search Engine Showdown: Testing Plagiarism Detection

Jonathan BaileyFebruary 14, 2008

6 minutes read

Though Google is by far the search leader in most of the World, it is not the only tool available. Other companies provide similar services and many of them have strong followings.

However, determining which search engine is “better” for most kinds of searching is difficult. It doesn’t matter how many results a search engine returns, but rather, how relevant and helpful those results are. That is a very subjective standard and one person could be very happy with a result while the other completely unsatisfied.

But there is one area where more is generally seen as better, plagiarism detection. When looking for copies of your own work, you want to find as many matches as possible, regardless of the order they are put in.

Unfortunately, that is a fairly specialized area of detection and one that not all search engines are well equipped for. Your typical search engine user looks for simple keywords and general information, not long phrases and specific copies. As such, most search engines don’t invest heavily in improving these types results.

But for those interested in searching for the content and detecting plagiarism, this raises the question “Which search engine is the best?” To help decide that, I put five of the top search engines through a battery of tests and the results were surprising.

The Tests

To cover as much of the market as possible, I decided to test the four top search engines, Google, Yahoo!, MSN and Ask, as well as an upstart meta search engine, Clusty.

I first tested their ability to detect static content that was copied. I ran a total of nine search results through each of the search engines, each used a statistically improbably phrase from different works I’ve written over the years and measured the number of results returned.

The first three searched dealt with poetry, which used shorter phrases and involved very high amounts of plagiarism/copying. Then I tested using three short stories, which have significantly less copying but longer phrases. Finally, I tested with three shorter prose pieces, such as essays, that are average in both copying and in phrase length.

When searching for the phrases, I included all supplemental results and included different pages on the same site since they are often separate infirngements. I did, however, browse through the results in order to seek out and eliminate any sites that contained the phrase I searched for, but not the actual work. That turned out to be unnecessary though as all results were relevant.

After those tests, I did one final search to test the ability of the search engines to detect plagiarism of dynamic content by using my Digital Fingerprint.

These tests are not scientific and are not designed to be the end all of search engine effectiveness in this area. Rather, it is just a quick and dirty look at how many results they return for searches for my own content and it is designed to give an idea of which search engines might be worth considering.

Your results may vary.

Static Content

The results of the static content searches are below:

	Google	Yahoo!	MSN	Ask	Clusty	Winner
Poem 1	72	81	29	18	42	Yahoo!
Poem 2	29	25	18	10	27	Google
Poem 3	21	29	10	6	14	Yahoo!
Story 1	0	2	2	0	2	Y/M/C
Story 2	0	3	2	2	4	Clusty
Story 3	0	3	2	1	2	Yahoo!
Prose 1	8	8	5	4	10	Clusty
Prose 2	12	5	9	2	12	G/C
Prose 3	19	16	8	7	12	Google

Totals	161	172	85	50	125	Yahoo!
Rounds Won/Tied	3	5	1	0	4	Yahoo!

When looking at the results, several things become clear. First is that Yahoo! was the winner, both in terms of number of rounds won and in terms of the total number of results returned. Google performed very well in all of the tests, save those for the short stories. This is interesting because Google failed to even return my own site as a result on all three occasions and no amount of tweaking the phrase (removing punctuation, shortening, etc.) seemed to help. Quick tests for other short stories on my sites had similar results, either returning just one or no hits.

The surprise of the test results was Clusty, which performed reasonably well in all tests, and even won or tied in more rounds than Google. Though its total results returned was significantly lower than both Yahoo! and Google, it established itself as a top-tier candidate in this area.

MSN and Ask, however, performed poorly all around. MSN returned less than half the results of Yahoo! and Ask returned less than one third. The best MSN could do was tie in one of the story rounds while Ask was unable to win or tie in any of the searches.

In the end, Google and Yahoo! finished neck and neck, with Yahoo! taking a slight lead due to Google’s strange performance in the story rounds. Clusty is right behind both of them and MSN and Ask are both left in the dust.

Dynamic Content

I decided to do a similar test using my digital fingerprint, as string of semi-random eight characters, to see which search engine could find more copies of my content from this site on the Web. In light of the previous results, these were very shocking.

	Google	Yahoo!	MSN	Ask	Clusty	Winner
Digital Fingerprint	297	6	6	48	19	Google

In terms of pure count, Google blew away the competition scoring over six times the number of results found than its nearest competitor, Ask. With almost 300 results, Google stood tall in this division. Ask was able to locate about 48 results, Clusty managed to find 19 copies but Yahoo! and MSN both were only able to locate six pages with my digital fingerprint.

However, this trouncing comes with a caveat. Google’s results were filled to the brim with pages that, most likely, did not belong in the search engine index at all. Aside from spam blogs and other regular sources of content theft, there were a lot of legitimate sources such as other search engines, blog indexes and caches that should not have had their sites indexed by Google.

It is going to take a more thorough analysis to see if Google actually caught more scraping or if they are simply indexing more sources of legitimate reuse.

Though I agree more is typically better, it seems likely to me that the Google result is highly inflated and the sites I visited seem to indicate that.

“Bonus” Round

In one final test, I ran a line from Shakespeare’s “Hamlet” into the five search engines to see how many copies of the work it detected. Once again, the results were surprising.

	Google	Yahoo!	MSN	Ask	Clusty	Winner
Hamlet	15800	27000	7330	2030	2010	Yahoo!

This time around, the results more closely mirrored the first round of testing but instead showing Yahoo! with a much larger advantage. The other three, were left in the dusty with Clusty, this time, picking up the rear.

It is important to note though that, with these kinds of numbers, the results counters are notoriously unreliable and this test was not intended to be serious in any way. It was merely a curiosity that drove me to see how one of the most copied works in history was being used on the Web.

The one thing it did prove conclusively is that, no matter what, I am no Shakespeare.

Caveats

A mentioned earlier, these tests were not designed to be scientific, but rather, were simply “quick and dirty” checks to see how the different search engines handled the same queries. Different people will likely get different results and the conclusions that one can draw from this are, admittedly, very narrow.

Any researchers interested in replicating my study can contact me and I’ll send them the exact queries used and works checked for. I don’t wish to post them here as doing so could further skew future tests should this post be scraped.

All in all, the goal of this test is not to set any rules about who to trust with your plagiarism searches, but rather, where to start looking.

Conclusions

Personally, looking at these test results, I would say that both Google and Yahoo! are acceptable solutions for detecting static content. However, since Google offers Google Alerts, an email service that automates most search functions, it is safe to say that, for most, it will remain the default for plagiarism searching, even though Yahoo! produced slightly more results.

However, there is obviously a benefit to broadening your search horizons and incorporating both Yahoo! and Clusty into your efforts. Both sites produced better results than Google on many tests and often caught sites that Google missed.

Another thing that becomes clear is that MSN and Ask, though likely great search engines in other regards, both lag far behind in this area and in none of my searches did I notice Ask or MSN picking up copies that were clearly missed by the other sites. They might be worth experimenting with further, but are likely not the best places to start.

In the end, these tests seem to have achieved their goals by giving us an idea of where to begin and what services we should likely avoid. Your mileage may vary, but these results do seem to speak for themselves.

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free