Why RSS Scraping Isn’t O.K.

Jonathan BaileyAugust 29, 2006

7 minutes read

Technology blogger Robert Scoble struck something of a nerve when he decried RSS scraping in responding to a post that claimed such scraping is acceptable as long as the format is in RSS.

Many bloggers chimed in on both sides of the issue. Several pointed to an earlier post by Scoble where he claimed that, “By publishing RSS as full text you’re buying into a system where your words will be republished in a variety of ways.”

This caused many to wonder why Scoble “changed his mind” and causing others to question the ethical and legal difference between spam bloggers, aggregators and an online news readers such as Bloglines.

These are tough questions and the conversation, which is ongoing, shows off many of the different views regarding RSS. However, the pro-scraping crowd has significant legal and ethical obstacles to overcome before such scraping every becomes acceptable.

The Second “S”

RSS, currently, stands for “Really Simple Syndication”. The pro-scraping crowd loves to point out the second “S”, which seems to prove that RSS feeds were intended for syndication. However, there are several flaws with that rather simple theory.

First, the current acronym for RSS is a relatively recent invention. Until RSS 2.0, it stood for “Rich Site Summary”, perhaps a more accurate term. In fact, all RSS 0.91 and 1.0 feeds technically still use that name.

Second, even though it does say “syndication”, that does not mean open syndication. Just because a format is easy to syndicate does not imply an open license to do so., much like how an unlocked door does not promise the right to enter a room. The Web is filled with standards that make content portable and none of them carry a blank check for reuse.

Finally, the name of a file type is not a binding contract nor an indication of implied license. An implied license, generally, only goes as far as what is required to use the content, not what the file name says. Caching Web pages is legal because of an implied license, RSS scraping is not.

Despite that, the implied license myth has traveled far and wide. Sadly for those who carry it, it is a myth crumbles under any close examination.

The Implied License

An implied license is usually derived from what is required for a user to take advantage of content offered to them and what is the consensus is in the industry. However, as the Scoble post proved, there is almost no consensus at all.

Furthermore, an implied license can not take effect when the publisher isn’t even aware of its existence. Since Myspace, Xanga, Blogspot, Wordpress.com and other sites publish feeds automatically with very few users knowing what they are or what the potential uses/dangers are, there is no way any of them can enter into an implied license.

Simply put, unlike publishing a Web page, which is a direct act and carries some implied licenses, publishing an RSS feed is often times automatic and, many times, without user knowledge. After all, only two percent of people use RSS feeds and less than eleven percent even know what one is. Compare that to eighty eight percent of people who know what “Spam” is.

In the end, public aggregation is not “a foreseeable inevitable consequence” of posting an RSS feed (meaning it is possible to use the feed without republishing it) and most people do not know of their own feed’s existence. Thus, even if there is an implied license with RSS feeds, it almost certainly does not extend as far as to allow resyndication for any purpose.

What About Fair Use?

The next favorite out for scrapers is to turn to fair use. Unfortunately for them, that argument falls equally flat.

Fair use is governed by a framework made up of four factors:

the purpose and character of the use, including whether such use is of commercial nature or is for nonprofit educational purposes;
the nature of the copyrighted work;
amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
the effect of the use upon the potential market for or value of the copyrighted work.

With that in mind, scraping is a classic example of what is not fair use. It takes the whole work, reproduces it, usually for commercial purposes and often without attribution, while offering no commentary, criticism or educational value. It also significantly damages the market for the work by creating a duplicate version of it.

It is highly unlikely that any copyright lawyer would attempt to bring forward a fair use defense to protect a scraper. Even if one did, it would almost certainly be shot down cold. While scrapers can help their case by taking only shorter snippets and carefully attributing their use, even that produces problems as the use still fails to “character of use” test.

Simply put, fair use is not a license to do what one will with another’s work, even if it isn’t for profit. Fair use is a pointed right that is targeted mostly at education, news and other public services. It has never been used successfully to protect the scraping of creative works.

Other Laws

Of course, copyright law is just one law that a scraper has to worry about. As I discussed earlier, scrapers can also be sued under a variety of other laws including trespass of chattels (property), the Computer Fraud and Abuse Act and even breach of contract.

In fact, most cases against scrapers have hinged on those other torts. Trespass of chattels was what one the Farechase case (Note: Fair use was upheld since only data, not creative works, were being scraped). It also came into play in the Ebay vs. Bidder’s Edge (PDF) case.

In these cases, the judgment is not against the use of the content, but rather, the act of scraping itself. Even if scrapers are able to change their use so that it has a better defense against a copyright infringement suit, the act of scraping will stay the same and scrapers are on a very long losing streak in court.

That seems unlikely to change any time in the near future.

The Ethical Difference

With legality aside, the question becomes simple. What is the ethical difference between a splogger and a Web-based aggregator like Bloglines.

Most seem to agree that the difference between republication and commercial aggregation is a fluid one. Still, many would prefer to have some kind of moral distinction between the two kinds of sites.

Perhaps the difference is in usability and features. After all, Bloglines does much more than just scrape feeds and repost, them it lets users subscribe, find and organize feeds. This helps both bloggers and users alike. Sploggers and scrapers, on the other hand, just take the content and add nothing to it. Though they might mesh it with other feeds, that adds little to the user in most cases and greatly injures the original blogger, who loses readership to a more convenient, but ill-gotten site.

Perhaps the difference is in the target. Sploggers target anyone who happens to visit the site, Bloglines only targets users who are members of the site and happen to either have actively subscribed to the feed or are searching for something in it. The goal isn’t to get the content blindly in front of as many eyeballs as possible, but rather to get it in the hands of the people already interested in it.

Perhaps the difference is in the direction of the traffic. If I post another’s content to a spam blog, visitors will likely subscribe to my feed and only visit my site, even if I provide attribution. With Bloglines, people are encouraged to subscribe to the original feed and that, in turn, encourages them to visit the original site. In short, Bloglines, like a search engine, is a middle man. A splogger, on the other hand, is a destination.

However, most likely the difference is a bit of all of those things combined with the intent of the site using the feed. A splogger or a scraper wants to make money off of search engine traffic and by drawing as many people to their site as they can. Bloglines wants to do so by offering a service to end users to help them subscribe, organize and find feeds. They get no direct benefit from your content, that I have been able to find, and their fortunes do not depend on how many people view your work.

The end goals, in relation to your content, are completely different. Their asset is not their content, but their service. While your content does appear on their site, it is not why people visit. People visit because it’s the best news aggregator for them.

Though it may seem small, that is a major difference.

Protect Yourself

Though the current legal and ethical climate makes it clear scrapers should not rely on any kind of implied license, neither should copyright holders. Attitudes and legal opinions can change at any time and there is no reason to leave any room for doubt.

I am a definite advocate of placing Creative Commons Licenses, or other copyright licenses, into an RSS feed. There are many tools to help you do that, including Feedburner.

Even if there is an implied license with RSS feeds, any explicit license would override it. This also prevents confusion and allows you to avoid stepping into ethical and legal traps by creating double standards. If you create a license, place it prominently in your feed and stick to it, you will be much better protected.

However, the most important step is to let everyone know that RSS is not a license to syndicate blindly and without permission. Rather, it is a tool to facilitate desired syndication, both by personal aggregators as well as other sites (Such as Blogburst).

It takes no time to protect yourself against the misunderstandings and uncertainties around RSS scraping. It’s best to do so. Otherwise, you might not be able to tell the bad guys from the good guys.

That alone creates a very confusing, and very dangerous, environment for everyone.

tags: Plagiarism, Content+Theft, Copyright+Infringement, Copyright, Scraping, Splogging, RSS, Feeds, Aggregation, Spamming, Splogs