Recently, Morris Rosenthal brought to my attention a Russian site entitled Analyzethis.ru. It’s a site that analyzes the results of various popular search engines from around the world and looks at various metrics including how well they filter out adult content, how much spam they contain and so on.
One of the factors they look at is how well the search engines are at ranking original texts over the duplicates. This, of course, is of great interest to anyone who posts original content online as, inevitably, it will be copied (legitimately and otherwise) and it’s important that the search engines do a reasonable job favoring the original author.
However, the results from AnalyzeThis.ru are, simply put, less than inspiriting. Google, according to the site, gets it right about 57% of the time according to the most recent survey. Even scarier, this is after a year of drastic improvement, up from under 10% in June of last year. Even more discouraging is that Google is easily the best, way ahead of Bing, which is currently hovering at 7%.
To be clear, I take these numbers with a good deal of skepticism. The site doesn’t clearly explain how it obtains its data but, even if the tests aren’t perfect, they highlight the point that Google isn’t either. Even after it’s updates, in these tests, Google still managed to barely get half of the cases right. Even if we assume this test generated 90% false positives in this area, that still leaves just under 5% of all works being mislabeled by Google as duplicates.
It’s easy to see how, with so many content creators and so many different works, this can become a very big problem very quickly.
Google’s Problem
For Google (and other search engines) the problem of duplicate content is a thorny one. When a user enters a search query. They don’t want to see ten pages with the same content, they usually want a variety of links to choose from. So, if a lot of pages on the Web have the same content, Google has to decide which one is the original and give it preferential treatment.
However, finding the original is easier said than done.
Google uses a variety of factors to determine which is the original such as the trust it has with the domain, the age of the site/page and the number of inbound links. However, these metrics aren’t perfect and, inevitably, Google gets it wrong from time to time.
Spammers, in turn, count on this. They often copy large amounts of content hoping that Google will believe at least some of it is original. Sadly, it’s a numbers game and the spammers, despite improvements, are still winning.
If these numbers are remotely accurate, it’s easy to see how, while Google has definitely made things harder on the spammers, they can still find great success copying content.
This further emphasizes how important it is for webmasters and content creators to be vigilant with their content and not put their faith in Google to keep them ahead of the bad guys.
Bottom Line
Just to be clear, I’m still very skeptical about these exact numbers. Without knowing more about the methodology, it’s impossible to be sure that they are accurate.
However, even with imperfect results, they highlight just how flawed Google is, even after the recent (and significant) improvements.
This makes it clear that, despite what Matt Cutts said, the scenario of a duplicate outranking an original work is not “highly unlikely” and seems to happen fairly regularly.
Obviously, this is a tough problem for search engines to crack and, until they do, webmasters need to be vigilant with their content.


I still find that most of the time the original article ranks best but occasionally newer content wins.
My personal theory: Topics deemed ‘Newsworthy’ or ‘Information that changes regularly’ are more time sensitive. Whilst content on more static topics isn’t treated as time sensitive.
ie. Celebrity Weddings vs History of the Roman Empire.
@explainafide What about celebrity weddings that took place in the time of the Roman Empire?
@Adam Senour That would cause Google’s Algorithms to crumble faster than the The Colosseum.
In fact with all their polygamy, illegitimate children, adulatery and beheadings maybe Google could use the Roman Empire to test out how well it understands a content trail.
I run a website that has had a terrible time with the Google algorithm updates of the last quarter of 2011. One of the things that has happened is that a couple of times I found my original content is now returned in Google search results way after scraped content. I believe this is a result of Google’s stated increased focus on the date of web pages. Previously if an article I had written 2 years ago was in search results it would be high up, way before any sites that had stolen or copied the content more recently. Now a scraper site with my content sometimes shows up before my original content because the scraped site has a very recent date and my content is a couple of years old. Google used to be very good a burying content stolen from me deep down in search results. So good in fact that I stopped even worrying about all the sites that steal my content. Now for some reason, (I think maybe because of their increased focus on “fresh” content), Google seems not as good at identifying and weeding out the stolen content. It seems that recently dated material has become very important to Google, sometimes at the expense of more valuable, original, content with a long comment thread, but an old date.
Now that you mention it.
I redesigned a professional services website recently and there were about 30 posts in total. I reorganised the dates more due to aesthetics than anything else and most of the articles that were re-dated to a newer date than original ranked better after a few weeks. I didn’t think much about it though.
That’s interesting. I usually see the opposite, where older content is treated with preference but I wonder if certain competitive genres are having such a heavy emphasis placed on newness that new content can easily trump old. Definitely worth looking into some more!
Another great post Jonathan.
I think Google is certainly improving but as discussed it has a long way to go.
I just completed some back link research on a niche within the writing industry and found the #1 spot for this particular niche was a writer that has set up 40 sham content farm websites, all linking to their main website. This person has been #1 for years, but is now beginning to lose the absolute domination of the niche- so that would appear Google is catching up with the techniques employed.
@explainafide Agreed that Google is doing a lot better. The big problem is that the spammers and other bad guys will improve their techniques and the race will begin all over again. It’s not a sprint for Google, but a marathon and, while the progress is good, one thing the chart does show is that every improvement is followed by a period of decreasing effectiveness.
Here’s hoping that doesn’t happen this time!
@myersnews I haven’t dug into their methodology. Bing getting it right 7% of time? Sounds too low. But we have improved a lot recently.
@mattcutts Thanks, Matt.
I don’t have serious doubts about this tool at all. To have those doubts would suggest the possibility that such a tool is useful and accurate, and that’s not the case here.
First of all, the creators are in the SEO field. They don’t make any effort to hide it. It’s on the footer of every page of the site.. That alone should cast doubt on the credibility of the tool…chances are at least some of the terms are client terms that they’re checking up on.
Second, any SEO worth his/her salt should know better than to run an automated rank checker (with 500 unique queries per day + client work, it’s a fairly safe bet that they’re not running the queries manually). For that matter, any SEO worth his/her salt should understand why this process is at best a useless exercise. It’s also a clear violation of search engine terms of use (not that I’m defending search engines in this regard, but this is one of those SEO 101 things someone in the industry should know.)
Third, Google has been known to present very different results based on several factors. Has this tool compensated for this? In all likelihood, no….such a task would be far too complex to be worthwhile, especially since these guys aren’t charging for their “service.”
This is nothing more than a junk science tool created to present data based on mysterious evaluations in an effort to try to curry favor with either the disenfranchised among the Internet marketing crowd or the crowd who is interested in something new and different and shiny and full of big impressive-sounding statistics.
Does that mean I think Google gets it right all the time? No. Even they’d say that. Does that mean I believe the ‘”57%” figure, either? No, since it “started” at under 10% and the issue of copyright was never THAT bad (I don’t think Bing is that bad that way, either). The issue exists, and Google among others will acknowledge as such…but a lot of it becomes a form of faith poison, where you believe you are affected by the illness since it’s easier to blame someone else than it is to take action yourself.
@Adam Senour I agree completely (as I mentioned in the article) that I’m highly suspect of the data for many reasons, including some of the ones you mention. I also want to know more about the methodology (how they determine what is original, etc.) and so forth.
Still, this is the only such test of its type I’ve seen done. Highly flawed, but a starting point for discussion and I do believe some of its conclusions, including that Google has gotten a lot better over the past year. However, I also agree that 57% is way too low. When I’m asked I usually say around 5%, which is why I hinted at the 90% error rate.
That being said, on some types of content, it may be well below 50%, there are a lot of variables at play.
Personally, I’m treating this a lot like Compete or Alexa statistics. Deeply flawed and not accurate in the least but the best we have.
@plagiarismtoday Nice comparison. The Alexa thing in particular made me laugh (not because it was a bad comparison, but just because it IS deeply flawed.)
@Adam Senour When I need a laugh, I check out Alexa’s stats for this site. Apparently I should be picking berries or something by now…
@plagiarismtoday If you want a real good laugh, check out SEMRush.com sometime. According to it, my one site receives 604 visitors per month from Google.ca. Digits are missing from that figure.
Google really has gotten better since last year. I was having a huge problem with it after their re-alignment of their system in early 2011. Sites, like blogs, that had a feed of RSS headlines from sites they followed were listing search results for my unique content higher than me. I reported it multiple times but really couldn’t get them to pay attention to me. Since about 2/3 of my site’s traffic is from searches, it really brought my visitor tally down – by about 20% for most of 2011.
Here’s what it looked like during that period – http://www.flickr.com/photos/typetive/5515516473/
It’s much better now. Re-running that same query, or a similar one for a rather unique phrase gets my content on top of any repeats. (Many of those excerpted bits are no longer even showing up.)
Glad to hear that they have improved for you. It’s been a real double-edged sword. Those who were helped by the changes were really helped, which is most it seems, but those who got mistaken for duplicate content got bit far, far worse than normal.
Overall, the changes have been good but I have a few who disagree…