dressed

Why RSS Scraping Isn’t O.K.

Technology blogger Robert Scoble struck something of a nerve when he decried RSS scraping in responding to a post that claimed such scraping is acceptable as long as the format is in RSS.

Many bloggers chimed in on both sides of the issue. Several pointed to an earlier post by Scoble where he claimed that, “By publishing RSS as full text you’re buying into a system where your words will be republished in a variety of ways.”

This caused many to wonder why Scoble “changed his mind” and causing others to question the ethical and legal difference between spam bloggers, aggregators and an online news readers such as Bloglines.

These are tough questions and the conversation, which is ongoing, shows off many of the different views regarding RSS. However, the pro-scraping crowd has significant legal and ethical obstacles to overcome before such scraping every becomes acceptable.

The Second “S”

RSS, currently, stands for “Really Simple Syndication”. The pro-scraping crowd loves to point out the second “S”, which seems to prove that RSS feeds were intended for syndication. However, there are several flaws with that rather simple theory.

First, the current acronym for RSS is a relatively recent invention. Until RSS 2.0, it stood for “Rich Site Summary”, perhaps a more accurate term. In fact, all RSS 0.91 and 1.0 feeds technically still use that name.

Second, even though it does say “syndication”, that does not mean open syndication. Just because a format is easy to syndicate does not imply an open license to do so., much like how an unlocked door does not promise the right to enter a room. The Web is filled with standards that make content portable and none of them carry a blank check for reuse.

Finally, the name of a file type is not a binding contract nor an indication of implied license. An implied license, generally, only goes as far as what is required to use the content, not what the file name says. Caching Web pages is legal because of an implied license, RSS scraping is not.

Despite that, the implied license myth has traveled far and wide. Sadly for those who carry it, it is a myth crumbles under any close examination.

The Implied License

An implied license is usually derived from what is required for a user to take advantage of content offered to them and what is the consensus is in the industry. However, as the Scoble post proved, there is almost no consensus at all.

Furthermore, an implied license can not take effect when the publisher isn’t even aware of its existence. Since Myspace, Xanga, Blogspot, WordPress.com and other sites publish feeds automatically with very few users knowing what they are or what the potential uses/dangers are, there is no way any of them can enter into an implied license.

Simply put, unlike publishing a Web page, which is a direct act and carries some implied licenses, publishing an RSS feed is often times automatic and, many times, without user knowledge. After all, only two percent of people use RSS feeds and less than eleven percent even know what one is. Compare that to eighty eight percent of people who know what “Spam” is.

In the end, public aggregation is not “a foreseeable inevitable consequence” of posting an RSS feed (meaning it is possible to use the feed without republishing it) and most people do not know of their own feed’s existence. Thus, even if there is an implied license with RSS feeds, it almost certainly does not extend as far as to allow resyndication for any purpose.

What About Fair Use?

The next favorite out for scrapers is to turn to fair use. Unfortunately for them, that argument falls equally flat.

Fair use is governed by a framework made up of four factors:

  1. the purpose and character of the use, including whether such use is of commercial nature or is for nonprofit educational purposes;

  2. the nature of the copyrighted work;

  3. amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

  4. the effect of the use upon the potential market for or value of the copyrighted work.

With that in mind, scraping is a classic example of what is not fair use. It takes the whole work, reproduces it, usually for commercial purposes and often without attribution, while offering no commentary, criticism or educational value. It also significantly damages the market for the work by creating a duplicate version of it.

It is highly unlikely that any copyright lawyer would attempt to bring forward a fair use defense to protect a scraper. Even if one did, it would almost certainly be shot down cold. While scrapers can help their case by taking only shorter snippets and carefully attributing their use, even that produces problems as the use still fails to “character of use” test.

Simply put, fair use is not a license to do what one will with another’s work, even if it isn’t for profit. Fair use is a pointed right that is targeted mostly at education, news and other public services. It has never been used successfully to protect the scraping of creative works.

Other Laws

Of course, copyright law is just one law that a scraper has to worry about. As I discussed earlier, scrapers can also be sued under a variety of other laws including trespass of chattels (property), the Computer Fraud and Abuse Act and even breach of contract.

In fact, most cases against scrapers have hinged on those other torts. Trespass of chattels was what one the Farechase case (Note: Fair use was upheld since only data, not creative works, were being scraped). It also came into play in the Ebay vs. Bidder’s Edge (PDF) case.

In these cases, the judgment is not against the use of the content, but rather, the act of scraping itself. Even if scrapers are able to change their use so that it has a better defense against a copyright infringement suit, the act of scraping will stay the same and scrapers are on a very long losing streak in court.

That seems unlikely to change any time in the near future.

The Ethical Difference

With legality aside, the question becomes simple. What is the ethical difference between a splogger and a Web-based aggregator like Bloglines.

Most seem to agree that the difference between republication and commercial aggregation is a fluid one. Still, many would prefer to have some kind of moral distinction between the two kinds of sites.

Perhaps the difference is in usability and features. After all, Bloglines does much more than just scrape feeds and repost, them it lets users subscribe, find and organize feeds. This helps both bloggers and users alike. Sploggers and scrapers, on the other hand, just take the content and add nothing to it. Though they might mesh it with other feeds, that adds little to the user in most cases and greatly injures the original blogger, who loses readership to a more convenient, but ill-gotten site.

Perhaps the difference is in the target. Sploggers target anyone who happens to visit the site, Bloglines only targets users who are members of the site and happen to either have actively subscribed to the feed or are searching for something in it. The goal isn’t to get the content blindly in front of as many eyeballs as possible, but rather to get it in the hands of the people already interested in it.

Perhaps the difference is in the direction of the traffic. If I post another’s content to a spam blog, visitors will likely subscribe to my feed and only visit my site, even if I provide attribution. With Bloglines, people are encouraged to subscribe to the original feed and that, in turn, encourages them to visit the original site. In short, Bloglines, like a search engine, is a middle man. A splogger, on the other hand, is a destination.

However, most likely the difference is a bit of all of those things combined with the intent of the site using the feed. A splogger or a scraper wants to make money off of search engine traffic and by drawing as many people to their site as they can. Bloglines wants to do so by offering a service to end users to help them subscribe, organize and find feeds. They get no direct benefit from your content, that I have been able to find, and their fortunes do not depend on how many people view your work.

The end goals, in relation to your content, are completely different. Their asset is not their content, but their service. While your content does appear on their site, it is not why people visit. People visit because it’s the best news aggregator for them.

Though it may seem small, that is a major difference.

Protect Yourself

Though the current legal and ethical climate makes it clear scrapers should not rely on any kind of implied license, neither should copyright holders. Attitudes and legal opinions can change at any time and there is no reason to leave any room for doubt.

I am a definite advocate of placing Creative Commons Licenses, or other copyright licenses, into an RSS feed. There are many tools to help you do that, including Feedburner.

Even if there is an implied license with RSS feeds, any explicit license would override it. This also prevents confusion and allows you to avoid stepping into ethical and legal traps by creating double standards. If you create a license, place it prominently in your feed and stick to it, you will be much better protected.

However, the most important step is to let everyone know that RSS is not a license to syndicate blindly and without permission. Rather, it is a tool to facilitate desired syndication, both by personal aggregators as well as other sites (Such as Blogburst).

It takes no time to protect yourself against the misunderstandings and uncertainties around RSS scraping. It’s best to do so. Otherwise, you might not be able to tell the bad guys from the good guys.

That alone creates a very confusing, and very dangerous, environment for everyone.

tags: , , , , , , , , , ,

14 Responses to Why RSS Scraping Isn’t O.K.

  1. Tammy says:

    Thanks for this great analysis of the second “S” and fair use arguments. It’s concise and really helpful to clearing up my own understanding of some of these issues. I had the “it’s your own fault you syndicated your feed not my fault I used it” argument used on me by a scraper last April. Now I can hand over a good web link to support my points!

  2. JB says:

    The victim-blaming element of it was one of the reasons that I decided to write this. I'm tired of scrapers blaming vicitms just because they provide RSS feeds.

    "It's not my fault that I copied all of your content and put it on my site, it's yours for making it available."

    Poppycock.

    I'm glad that it helped you out and gave you some ammunition to fire back with! 

  3. JB says:

    The victim-blaming element of it was one of the reasons that I decided to write this. I'm tired of scrapers blaming vicitms just because they provide RSS feeds.

    "It's not my fault that I copied all of your content and put it on my site, it's yours for making it available."

    Poppycock.

    I'm glad that it helped you out and gave you some ammunition to fire back with! 

  4. Ja says:

    All the same, people that are putting out new “standards” I feel should have a responsibility to put some safeguards in place. I’m not talking RSS though… I mean Microformats and the like.

    I was on a site of a 16 year old kid that made dapps to search for hreviews and hcards and on this site WordPress site of his, he had more microformat markup than even Tantek uses. The short of it is, just by getting the metadata he put on his pages I know exactly where he lives down to the street and house number. And people are just worried about MySpace, heh.

    People are very excited about the microformats and the semantic web but don’t take the time to think about exactly what they’re exposing themselves to and the developers either don’t care or haven’t really thought it through themselves.

    I do have faith this will settle itself one way or another. One thing that could be looked at is security schemes but that would limit the exposure people get which they’ll have none of I’m sure.

    I’m not defending sploggers or plagiarists the least bit but I am suggesting you don’t leave your front door wide open in a bad neighborhood.

    J?

  5. Ja says:

    All the same, people that are putting out new “standards” I feel should have a responsibility to put some safeguards in place. I’m not talking RSS though… I mean Microformats and the like.

    I was on a site of a 16 year old kid that made dapps to search for hreviews and hcards and on this site WordPress site of his, he had more microformat markup than even Tantek uses. The short of it is, just by getting the metadata he put on his pages I know exactly where he lives down to the street and house number. And people are just worried about MySpace, heh.

    People are very excited about the microformats and the semantic web but don’t take the time to think about exactly what they’re exposing themselves to and the developers either don’t care or haven’t really thought it through themselves.

    I do have faith this will settle itself one way or another. One thing that could be looked at is security schemes but that would limit the exposure people get which they’ll have none of I’m sure.

    I’m not defending sploggers or plagiarists the least bit but I am suggesting you don’t leave your front door wide open in a bad neighborhood.

    JÄ?

  6. [...] What is worse than the debate itself is the tendency among some in it to blame the victims or RSS scraping because they chose to use full feeds. Whether it is a misguided notion that everything in an RSS feed is fair game or that the victims were “asking for it” is not important. [...]

  7. [...] guest Jason Calacanis, blog entrepreneur and former head of Weblogs Inc., the panel reaches roughly the same conclusion that I did back in August, that there is no implied license to scrape RSS [...]

  8. [...] I did a cursory search of the web and found an interesting article by Jonathan Bailey which suggests that rss feeds are, by definition, public domain protected, syndicated property and should be respected as such.  (thanks for the correction). [...]

  9. [...] talked several times on this site about why RSS scraping is not acceptable. Yet, many in the pro-syndication camp continue to talk of implied licenses or variations of the [...]

  10. [...] just in case you’re not convinced that this sort of thing is wrong, here’s what Jonathan Baily from Plagiarism Today has to say about it: … scraping is a classic example of what is not fair use. It takes the [...]

  11. [...] to the scraping practice that Global Grind was utilizing. I included examples as well as an article that explained why RSS scraping was not OK. Even though they appear to be simply scraping the page itself, not our RSS feed, some of the same [...]

  12. [...] scraping,Safe-Harbor'; In 2006 I wrote an article entitled “Why RSS Scraping Isn’t OK” that laid out many of the arguments provided in favor of RSS scraping and republishing and [...]

  13. [...] Later: Why RSS Scraping Still is Not OK // Five years ago I penned an article entitled “Why RSS Scraping Isn’t OK“. The goal of the article was to take a look at the arguments scrapers used, legal and [...]

  14. [...] The climate was just right at the time. Ad networks seemed willing to turn a blind eye to such “aggregation”, Google had not yet made a push to prevent content farming and there was a lot of legal uncertainty and confusion surrounding RSS scraping. [...]

Leave a Reply

STAY CONNECTED