FAQs: The Basics of RSS Scraping

RSS scraping is one of the most common and most frustrating types of content theft bloggers, forum admins and other site owners will face as they grow their presence online. Not only does it, often, allow the scraper to grab all of the content from the original site easily, but it also is a tactic used by spammers, who not only are able to exploit the content for search engine gains, but are also among the most despised infringers online.

As such, it’s important for all webmasters and content creators to be aware of what RSS scraping is, how it works and where it’s going in the future. Even though RSS as a protocol may be on the ropes, RSS scraping is not a problem that’s going away and, in fact, may be getting a lot worse in the coming years.

With that in mind, here is a quick FAQ on some of the more common questions asked about RSS scraping and what can be done about it.

What is RSS?

RSS, sometimes referred to as Really Simple Syndication or Rich Site Summary, is a protocol that makes it easy for other sites and tools to access the content in your site by formatting your content in a consistent, easy-to-parse way.

Contrary to an HTML document, which could have the content be anywhere on the page, RSS indicates clearly what is the headline, body and other elements of the content. This makes it easy to grab the content and display it elsewhere without the surrounding formatting and HTML code.

How is RSS Normally Used?

Traditionally, RSS has been used to enable readers to subscribe to a site using various RSS readers such as Google Reader, Feed Demon and even many mail clients.

However, RSS has also been used to power other services, such as email newsletters and even Facebook integration.

What is RSS Scraping?

RSS scraping is when a third party, usually a spammer, grabs the content in an RSS and republishes it wholesale on another site.

In this regard, RSS scrapers work a great deal like Google Reader, grabbing your site’s content and displaying it on a site but, where Google Reader places the content behind a password protected wall that can only be accessed by the subscriber (or those who are shared the individual story), scrapers instead place the content on a public site for anyone to view, including search engines.

Why do People Scrape RSS Feeds?

Spammers seek high rankings in search engines so they can get traffic to display their ads against or sell products with. To do this, they need content but creating content by hand is time-consuming and difficult, especially when much of it is going to make no difference in the search engines.

RSS scraping is an easy way for spammers, and other sites, to quickly fill their pages with content, even if the content comes solely from other sites.

How Can RSS Scraping Hurt Me?

In most cases, RSS scraping doesn’t hurt. Google and other search engines have become savvy enough about spam that most of the time, they don’t give much credence to spam sites, keeping them from getting a lot of traffic or harming you in the rankings.

However, the system is far from perfect and there are many times spammers outrank the sites they scrape from for relevant terms. This is especially true with new sites or those that don’t have a strong search engine presence.

Less likely is that others may confuse the spam site as either being the original site or as being one endorsed by you, thus actively taking traffic from you. Few people, however, make this mistake with spam sites as the distinction is usually very clear.

All in all, the risk from an individual case of RSS scraping is actually fairly low, but the problem is that there is rarely just one or two such scrapers working at any given time.

What Can I Do About RSS Scraping?

Dealing with RSS scraping starts with good SEO practices. If you link between your posts, get good inbound mentions and earn social networking shares, odds are that RSS scraping won’t greatly impact you.

If it does, you can alway seek to have the content removed by either filing a DMCA notice with the spammer’s host or, if that fails, sending one to Google.

If RSS scraping becomes a more serious and more recurring problem, you may want to consider truncating your feeds or eliminating them. Though that would be an extreme last resort.

Is RSS Scraping Illegal?

Some have made arguments that distributing your content via an RSS feed, even if you didn’t realize you were doing it, creates an implied license to use it in this manner. However, there are many problems with that and other related arguments on RSS scraping.

Generally, RSS scraping is considered to be copyright infringement, though there are other legal arguments against RSS scraping as well.

What if I Want to Encourage RSS Scraping and Reuse

If you want others to scrape your RSS feed, you can actually give blanket permission to do that by inserting a Creative Commons license into your feed. This will let bots that do scraping know your intentions and, those that are complying with the law should be able to follow your wishes.

How Can I Track RSS Scraping?

Many people will find RSS scrapers on accident when they search for keywords relevent to their blog or site. However, you can keep track of your content using automated tools like Fairshare that are designed for tracking dynamic content.

In the end though, its best to keep an eye on the search engines for terms that others commonly find your site through as scrapers will often show up for those same results though, initially, they will likely be lower than your site.

What is the Future of RSS Scraping

Though it’s difficult to predict what spam tactics will be popular in the coming years, RSS scraping has been a problem for at least six years and is continuing today.

That being said, it has fallen out of favor with many spammers, who prefer content generation or scraping excerpts from feeds to avoid duplicate content penalties in the search engines. Still, many active spammers use the method though spammers have clearly become more diversified in this area.

Bottom Line

There’s no doubt that RSS scraping can be and often is very annoying and very problematic. That being said, there’s no reason that it should be a major headache or that it should become a reason to walk away from your site. Most cases of RSS scraping don’t have a major impact on a blog and those that do can usually be dealt with.

That being said, if you are having a serious problem with RSS scraping, please feel free to drop me a line or, if you think you may need outside help, feel free to see if I can help via my consulting services.

All in all, RSS scraping is a reality most bloggers and webmasters will have to deal with, but it’s not one that should sink your site if you’re savvy about how to handle it.