How Accurate is Google with Detecting Duplicates?
Recently, Morris Rosenthal brought to my attention a Russian site entitled Analyzethis.ru. It’s a site that analyzes the results of various popular search engines from around the world and looks at various metrics including how well they filter out adult content, how much spam they contain and so on.
One of the factors they look at is how well the search engines are at ranking original texts over the duplicates. This, of course, is of great interest to anyone who posts original content online as, inevitably, it will be copied (legitimately and otherwise) and it’s important that the search engines do a reasonable job favoring the original author.
However, the results from AnalyzeThis.ru are, simply put, less than inspiriting. Google, according to the site, gets it right about 57% of the time according to the most recent survey. Even scarier, this is after a year of drastic improvement, up from under 10% in June of last year. Even more discouraging is that Google is easily the best, way ahead of Bing, which is currently hovering at 7%.
To be clear, I take these numbers with a good deal of skepticism. The site doesn’t clearly explain how it obtains its data but, even if the tests aren’t perfect, they highlight the point that Google isn’t either. Even after it’s updates, in these tests, Google still managed to barely get half of the cases right. Even if we assume this test generated 90% false positives in this area, that still leaves just under 5% of all works being mislabeled by Google as duplicates.
It’s easy to see how, with so many content creators and so many different works, this can become a very big problem very quickly.
Google’s Problem
For Google (and other search engines) the problem of duplicate content is a thorny one. When a user enters a search query. They don’t want to see ten pages with the same content, they usually want a variety of links to choose from. So, if a lot of pages on the Web have the same content, Google has to decide which one is the original and give it preferential treatment.
However, finding the original is easier said than done.
Google uses a variety of factors to determine which is the original such as the trust it has with the domain, the age of the site/page and the number of inbound links. However, these metrics aren’t perfect and, inevitably, Google gets it wrong from time to time.
Spammers, in turn, count on this. They often copy large amounts of content hoping that Google will believe at least some of it is original. Sadly, it’s a numbers game and the spammers, despite improvements, are still winning.
If these numbers are remotely accurate, it’s easy to see how, while Google has definitely made things harder on the spammers, they can still find great success copying content.
This further emphasizes how important it is for webmasters and content creators to be vigilant with their content and not put their faith in Google to keep them ahead of the bad guys.
Bottom Line
Just to be clear, I’m still very skeptical about these exact numbers. Without knowing more about the methodology, it’s impossible to be sure that they are accurate.
However, even with imperfect results, they highlight just how flawed Google is, even after the recent (and significant) improvements.
This makes it clear that, despite what Matt Cutts said, the scenario of a duplicate outranking an original work is not “highly unlikely” and seems to happen fairly regularly.
Obviously, this is a tough problem for search engines to crack and, until they do, webmasters need to be vigilant with their content.
Want to Reuse or Republish this Content?
If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.