Plagiarism Detection Showdown: Bing vs. Yahoo! vs. Google

Jonathan BaileyJune 3, 2009

6 minutes read

logocombo

With the launch of Microsoft’s new Bing search engine, there has been a lot of talk about which site produces the best results. Though Google is far and away the king in terms of popularity, Yahoo! still holds onto sizeable minority and, with Microsoft’s new offering, there is a promise of a three-way horse race.

But for readers of this site, one of the critical questions when choosing a search engine is which does the best job of finding duplicated content? Whether one is looking to verify that a work is original or track down plagiarism of their own work, this is an important question.

So I decided to take the three search engines and put them head to head in what I am calling the “5 Round Search Engine Deathmatch”. The goal is to find which search engine performs the best and detecting plagiarism and, if possible, find the strengths and weaknesses of the three.

Without any further ado, here is how it went down.

The Test

The goal of the test is not to be exhaustive, but to provide a quick overview of the various capabilities of the three search engines in this area. To do that, I chose five different works and ran unique phrases of those works through the search engines.

The first two works were poems of mine that had seen widespread reuse but had limited enough copying to be useful. The next two were two prose works, both with very limited reuse. The last was the Declaration of Independence, to see how many matches the search engines reported for that.

With each set of results, I went through by hand and counted the number of clear duplicate entries (matches where it was the same content, but at a different URL) and created a new total. The results of the five rounds are below.

Round 1

For this round a poem with widespread copying was used.

Search Engine	Initial Results	Duplicates	Final Total
Bing	13	2	11
Google	47	17	30
Yahoo!	57	3	54

The first test was a big win for Yahoo!. It not only found more results initially but also had fewer duplicates. Cast both a more broad and a more fine-tuned net. However, Google’s results were still respectable, catching most of the worst infringement despite having barely half of Yahoo!’s results.

Bing, on the other hand, brings up the rear. With barely a fifth of Yahoo!’s and a third of Google’s results, it didn’t perform up to any standard.

Round Winner: Yahoo for both better detection and better duplicate filtering,

Round 2

For this round, another widely-copied poem was used.

Search Engine	Initial Results	Duplicates	Final Total
Bing	6	2	4
Google	86	6	80
Yahoo!	43	3	40

In this round Google turned the tables, doubling Yahoo!’s efforts. It found 80 non-duplicate results to Yahoo!’s 40. However, both had comparable duplicate filtering, both with approximately 7% duplicates. Not a bad percentage.

Bing, on the other hand, only found four results, just 5% of Google’s results and 10% of Yahoo!’s. It also had a much higher duplication problem with 33% of its results being duplicates.

Round Winner: Google. Though and Yahoo! had identical duplication filtering, Google produced twice the results.

Round 3

For this round, a prose piece with limited reuse was used.

Search Engine	Initial Results	Duplicates	Final Total
Bing	0	0	0
Google	6	0	6
Yahoo!	4	1	3

Google has another solid round, with 6 copies found and no duplicates. Yahoo!, on the other hand, only found four sites and one was a repeat. This is a clear case of Google finding more copies of a work, three to be specific.

Bing, on the other hand, found nothing. It didn’t even find the original site, which would have been good for one point. In short, Bing flat out failed in this test.

Round Winner: Google for both finding more matches and providing more accurate results.

Round 4

For this round, a prose work with a modest amount of known copying was used.

Search Engine	Initial Results	Duplicates	Final Total
Bing	2	1	1
Google	27	22	5
Yahoo!	2	1	1

The results of this test were interesting. Google picked up the most results but its initial find of 27 results was complicated by an enromous duplication issue. A full 22 of the results were obvious duplicates, including over a dozen results pertaining to the same forum posting. This pushed the real infringements to the last page of the results.

Yahoo! and Bing both found two results, both on my site. One, in both cases, was a duplicate.

Round Winner: No one. I thought long and hard about this one and have decided that no one deserves to win this round. Google found infringements but the results were so buried as to be useless. Yahoo! and Bing failed to find anything.

Round 5

For this round, a line from the opening of the Declaration of Independence was used.

171700

Search Engine	Initial Results	Duplicates	Final Total
Bing	4,800,000	N/A	N/A
Google	53,700	N/A	N/A
Yahoo!	118,000	N/A	N/A

For these results, due to their length, I am forced to take the search engine estimations at face value. However, I am having a very hard time believing Bing’s Results.

This would indicate that Bing, the out and out loser up until now, somehow found some 28x more results than Google and Yahoo! combined. This seems outrageous on multiple fronts and likely points to a flaw in Bing’s URL counting system (Note: Neither Yahoo! or Google have extremely reliable systems but I was able to hand check the results up to this point.)

As such, I am tossing Bing’s results for this round. Of the two that are within the realm of possibility, Yahoo has a definite edge with nearly double the results. Still, I don’t put too much stock in this test due to the unreliable nature of search engine self-reporting.

Round Winner: Yahoo found more results and takes the test, though this really is not a definitive test, even less of one than the others.

Final Results

Of the four rounds that had winners, Yahoo! and Google split them 2-to-2. However, Yahoo!’s victory in the last round is both less compelling and of less use to most copyright holders than the other rounds so I have to declare Google, by a hair, the overall winner in this deathmatch.

But while this probably will not soothe the debate about Yahoo! vs. Google, it does indicate one thing very clearly, for the purpose of plagiarism detection, Bing is a wash. Not only did bing place or tie for last in every test (save the dubious results in the fifth one) it failed to detect any copies of the content in one.

To make matters even worse, Bing also makes it very difficult to filter our duplicate entries. Where Yahoo! and Google both attempt (somewhat unsuccessfully) to group likely duplicates together, Bing, it seems, does not. In many cases duplicate pages were spread across multiple pages rather than being indented under the original or clustered together.

Whether or not you make Bing your search engine of choice is your decision, but it probably should not be your search engine of choice for finding copies of your content.

Bottom Line

I’ve said it before and I will say it again. Don’t rely on any one search engine for your plagiarism detection. Both Yahoo! and Google found results the other missed. It is that simple.

However, if we’ve learned anything from this test it is that Bing, at least right now, is not ready to be relied on for plagiarism/copy detection. Whether one thinks its regular search results are solid or not, its phrase search results, the ones used to detect this kind of copying, are very weak.

This may be a bit unfair in that both Yahoo! and Google are established search players while Bing is just a preview release. However, since Live.com search results are already being forwarded on and users of IE6 are having it forced on them as their default search engine, it seems fair enough to put it to the test.

Right now, Bing is not up to this challenge though my hope is that, as they work on the search engine and improve it, that they may grow to become a viable third competitor.