Scraping: Not Just for RSS Feeds Anymore

Jonathan BaileyNovember 16, 2011

4 minutes read

Pencil Copy Image Back in 2005, I wrote an article entitled “Truncated Feeds: A temporary Solution“. It was about a trend that was popular at that time of truncating or shortening RSS feeds to discourage scraping.

The reason for feed truncation is simple, most RSS scraping takes place through the RSS feed so truncating it, or showing only the first few paragraphs of it, meant scrapers couldn’t grab the whole post and could only do limited damage to your site and its content.

However, as I pointed out in my article, the technology was already available (and already a decade or more old) to scrape content out of the Web page itself. This meant that truncating an RSS feed, while useful against most scrapers (at the risk of angering readers) was a temporary solution until spammers started using mare advanced methods.

Now, it appears I might have been ahead of the curve. At least two clients and several others I’ve talked with have reported that, despite either having no RSS feed or only a truncated one, that their site’s full content is being scraped. The problem seems to be growing and it seems likely that it will get a lot worse before it gets any better.

Why RSS Scraping Was (IS) King

RSS Scraping became popular because of how simple it was. RSS feeds, unlike regular HTML, have a predictable format and structure that makes it easy to extract the content from them. It’s how RSS readers work as well as RSS scrapers.

RSS feeds are plentiful and their ease of use also opens up the doors to other kinds of manipulation, such as changing out words, inserting links and clipping off unwanted portions.

However, RSS feeds also have a serious problem. For one, webmasters, after they learn about the scraping often truncate the feed or insert warnings into it to make it useless to the spammer. Second, webmasters often alert Google and other search engines via RSS when a new post goes live, making it so that the spammer has a difficult time getting to the search engines first.

Combine that with the fact that recent Google updates, including Panda and Farmer, have been pounding sites seen as content farms or spam, many spammers have been seeking alternate ways to get their content in recent months.

Non-RSS Scraping Comes to Town

Non-RSS scraping came to the forefront in January of last year when Google announced that Google Reader could track changes on any site. Though the feature wasn’t evil in and of itself, only displaying truncated content, it brought the issue to the forefront and proved that the technology was practical, scalable and functional.

Even though it was shut off just nine months later, the proof of concept was still there and it seems at least some spammers took notice.

However, Google wasn’t the first nor was it the last, it was merely the biggest. Yahoo! Pipes and Page2RSS have long done the same thing for years (Note: Once again, I’m not saying these are the services the spammers are using, they are once again mere proofs of concept) and there are countless downloadable applications that can run on a server. This is, most likely, the approach being taken by spammers.

Obviously though, the tech is there and has been for some time, but now it seems to have the attention of at least some spammers and that, in turn, is going to change the game for content creators sooner rather than later.

How Webmasters Can Fight Back

These tools work because A) It’s trivial to detect changes in a website and B) even though sites are different from one another, the pages within a site tend to remain fairly consistent.

This is a big part of how sites are operated today as most are template-driven. To make matters worse, with CMSes like WordPress, Drupal, Joomla, etc., there’s often a lot of consistencies between sites that use the same platform, making it even easier.

This makes fighting back against this kind of scraping much more difficult. However, the usual tips for fighting against scraping remain relevant, just no longer solely for the RSS feed:

Link to Your Content Regularly: Try to include one or two links in each of your posts that reference a page or a post on your site. This helps tell the search engines which site is the authentic one and pass along the credibility accordingly.
Include a Footer: Including a footer in the content area of your site may cause it to get picked up along with your text, especially with highly automated scrapers. You can also include a digital fingerprint in the content area, though when searching for it later you’ll have to omit your site.
Breaking Apart Content: Breaking content apart across multiple pages is a controversial strategy (one that your readers will likely not care for) but it can also hinder this style of scraping similar to how truncated RSS feeds work.

All in all, this type of scraping is going to be much more difficult to combat as there is no feed you can simply disable, alter or truncate if things get out of hand. It’s a problem that will have to be met head on and one webmasters will have to be vigilant about.

Bottom Line

The good news in all of this is that this type of scraping is still not very common, at least not that we know of. However, if you have a full RSS feed, it may be difficult to know where spammers are getting the content from.

Still, this type of scraping does appear to be on the rise. The only question is if it will be a long-term trend or a short-term fad. Right now, it’s looking more likely it will be the former.

This is why webmasters need to be aware of this problem so they can be vigilant against it and, when needed, deal with the problem.

In the end though, the only real question I have is why did it take so long for spammers to pick this up? Usually on the cutting edge, they seem to be at least five years behind the curve on this one, if not more.

Maybe someone can provide an explanation for that one below…

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free