The Slow Rise of Whole-Site Scraping

Jonathan BaileySeptember 5, 2013

4 minutes read

The nature of spam on the Web is constantly evolving, usually in response to the shifts that Google makes in its algorithms that are designed to address spam and other forms of undesirable content.

For a time, it appeared that scraping, in particular RSS scraping, was on the wane. Google had been tackling duplicate content and content farm sites, in particular with its “Panda” updates that began in 2011, and that seriously hurt the prospects of most scrapers.

As a result, spammers had to turn their attention to other methods for getting their content to the top of Google.

For much of the past few years, the focus has been on linking and link-building. However, Google has also recently started to clamp down on questionable linking practices as well, issuing manual and automatic penalties for sites that have what it deems unnatural inbound links.

This has been something of a one-two punch for many webmasters. On one hand, Google’s penalties have gone well beyond sites that were intentionally spamming, even hitting some sites that had links placed without their permission, and on the other it’s sent spammers back to the drawing board on spam techniques, once again looking to prey on legitimate sites.

Some, it seems, have settled on scraping and content duplication as a means of filling their pages. However, it’s not just RSS scraping that’s becoming common. Recently, I’ve been seeing a rise in whole site scraping including copying HTML, images and other content that is spidered, scraped and reposted on another server.

So what can you do to ensure your site isn’t impacted? To answer that we first have to look at the nature of the problem and then see what can be done about it.

The New Scraping

Historically, scraping has been primarily done through the use of RSS feeds. RSS feeds put content in a simple format that made it easy to scrape. Couple that with misguided legal theories saying that RSS scraping is acceptable, and it became a popular target for spammers.

The content from RSS feeds would then be ported over, usually wholesale, into crudely-designed blogs that were designed to fool the search engines into thinking they were the original content.

However, as Google became more savvy at spotting the fake blogs and began to clamp down on sites with large amounts of duplicate content, the method became less and less effective. This caused spammers to largely abandon the tactic (though some continued using it and still do), this pushed RSS scraping to the back burner of issues webmasters had to worry about but now the broader issue of scraping seems to be coming up again, though this time the RSS feed isn’t necessarily involved.

This was something I talked about initially in 2005 and again in 2011, however, the spammers involved aren’t interested in just the content, increasingly, they’re interested in the entire site, porting it over to a new domain, often updating it as the original is changed.

Though I only have anecdotal evidence to date, over half of all of the scraping cases I’ve been seeing in recent months have involved not just RSS scraping, but whole site or near-whole site scraping. What’s more disconcerting is that I’m hearing more and more about scraping in general, indicating that, most likely, the practice is on the rise to at least some degree.

Defending Against Whole Site Scraping

Distil Logo Defending against whole site scraping is much more difficult than RSS scraping. Where, previously, you could truncate your RSS feed or add information in the footer of the feed, those tricks don’t work when a spammer is pulling content directly from your site.

However, there are things that can and should consider doing to mitigate or eliminate the problem. Those include:

Use Google Webmaster Tools: Join and use Google Webmaster Tools. It can alert you to duplicate content issue and any unusual inbound links.
Link to Yourself Routinely: Be sure to link to other articles on your server so the scraper site, unless it removes links or alters them, will point to you. This helps with potential SEO issues and lets you track these scrapers with your site stats and Google Webmaster Tools.
Track Your Content: Use the standard methods to track your content online. Consider using services such as BlogAvenger (previous coverage) to help track your work more completely.
Consider Using Anti-Scraping Technology: If the problem is a serious concern, considering using a service such as Distil (previous coverage) or anti-scraping plugins to stop the scrapers from accessing your site.
Booby Trap Your Code: Finally, you can also add traps into your code using JavaScript to check and make sure that the code is being loaded on the correct domain and then take some action against the site. I’ll have more on this in a later post.

All in all, there are steps that you can take to prevent or reduce this type of scraping but, as with any change in spammer tactics, it’s important that your tactics change as well.

Bottom Line

Spamming has always been something of a numbers game. This is just as true of web spam as it is of email spam. If Google is able to successfully spot and demote 99.9% of spam blogs, then a spammer knows they have to make an average of 1000 spam blogs to have a success and that is something they can easily do.

However, as Google has shifted and altered its approach to fill its weaknesses, spammers have had to scramble to find new approaches and, lately, one of those approaches has been whole site scraping.

Whether or not this trend will last or catch on remains to be seen, but in the meantime, it’s important to watch out for it, especially if your site is already vulnerable in Google due to other penalties (such as the aforementioned linking penalties).

Though your site may not be one of the ones bit, there’s no way to know for certain unless you’re aware of the problem and looking for it.