Five Years Later: Why RSS Scraping Still is Not OK

Jonathan BaileyAugust 17, 2011

5 minutes read

Sample RSS Icon Five years ago I penned an article entitled “Why RSS Scraping Isn’t OK“. The goal of the article was to take a look at the arguments scrapers used, legal and ethical, and explain why the realities of the law were not on their side.

Basically, at that time, RSS scrapers were arguing that, by putting content into an RSS feed, one was giving permission to use it on other sites, essentially creating an implied license to republish it. However, as I talked about in the previous article, the legal realities are much different and RSS scraping without per mission is, almost certainly, a copyright infringement.

However, while the legal realities haven’t changed much in the past five years, the people doing the scraping have. Spammers and sploggers, now wary of duplicate content issues, have largely abandoned RSS scraping in favor of other techniques. Today, the scrapers are fewer but place themselves as editors, curators and collectors, people building moderated lists of great content.

This shift hasn’t done much to alter the legal realities of scraping nor has it done much to placate creators who still see this as one of the most common issues they face.

The truth is that, even with this new veneer, RSS scraping is still not legally or ethically acceptable. Whether it’s curators or spammers, those who scrape from RSS feeds are in a dubious position and one that seems to be getting worse every day.

The Past Five Years Of Law and Scraping

The past five years of legal history have been strangely quiet on the issue of RSS scraping. Despite how common the behavior is, very few suits have dealt with the issue.

The best known of those cases was Gatehouse Media vs. The New York Times. Which saw Gatehouse Media, the owners of “Wicked Local” brand sites as well as hundreds of smaller papers, sue the New York for the Times’ scraping of their RSS feeds for inclusion on Boston.com’s “Your Town” section.

The suit only centered around the headlines and excerpts from the stories involved but the Times felt their position was weak enough to warrant settling the matter publicly and quickly. In the end, the New York Times agreed to stop scraping Gatehouse feeds and respect restrictions placed by Gatehouse Media on the content.

Related cases on the issue of data scraping, sometimes called data mining, have largely been equally negative for the scrapers. Though only at the summary judgment phase at last report, the Snap-on Business Solutions Inc. v. O’Neil & Assocs., Inc, highlights the other legal perils of scraping.

In that case, Snap-on produced and maintained a database of auto parts for Mitsubishi. After two years, Mitsubishi began to look at other vendors for the contract but Snap-on would not give up control over the data. Mitsubishi eventually hired an outside contractor, O’Neil, to scrape the content out of the database and bring it into a new system. When Snap-on learned of the scraping, they filed suit.

In the summary judgement phase of the case, the judge ruled that Snap-on likely had arguments regarding the Computer Fraud and Abuse Act (CFAA), Trespass to Chattels and Breach of Contract. The court rejected a copyright infringement argument, but only because the content copied did not qualify for copyright protection, unlike with RSS feeds.

The case shows, as I pointed out years ago, that scraping is a legal minefield. Even cases that seem to go the way of the scraper, such as the Cvent, Inc. v. Eventbrite, Inc. case, are highly fact-specific and seem to hinge more on poor case preparation than the law itself. (Note: Even in that “victory” the copyright claims and the unjust enrichment claims survived dismissal.)

Instead, most seem to follow the route of the Ticketmaster L.L.C. v. RMG Technologies, Inc. case, a 2007 win for Tickemaster against a sniping service that was snatching up popular tickets using an automated process. In that case, the court ruled RMG was infringing copyright by merely browsing the relevant pages since they were doing so in violation of Tickemtaster’s “browserwrap” license.

In short, the legal realities for scraper are even more bleak than they were five years ago. The implied license argument that’s so popular among scrapers has been eroded and, all in all, it’s almost impossible to scrape legally, RSS or otherwise.

Yet, what’s changed in the last five years isn’t so much the law, but the scrapers themselves and that’s where things have truly gotten interesting.

The Death of the Spammer Scraper

Back in 2006, your “typical” RSS scraper was probably a spammer, someone seeking a quick, hands off way of filling a large number of sites with search engine friendly content to rise in the rankings and, eventually, usurp the original work for certain keywords.

Those days, however, are gone. Though scraping spammers still exist, most spammers moved on from this method as Google and the other search engines improved their duplicate content detection, making it a less effective technique. Methods such as content spinning, content generation and even cheap outsourcing have proved to be more effective and equally reliable.

This decline has largely mirrored the overall decline in traditional (reader-based) RSS usage. RSS is falling out of vogue, at least as a tool for scraping and reading, but not as a tool for “curating”.

The reason is that tools for integrating RSS into existing websites have grown much more common and easier to use in the past five years. Though some were developed for the use of spamming, other tools were meant to allow authors to integrate all of their sites in one place. However, some authors have latched onto these tools as a way of bringing in the work of others without permission.

This has created a situation where the people doing the scraping are fewer in number, but likely much more dangerous. Where search engines were relatively effective at filtering out spammers, these sites tend to appear to be much more legitimate, increasing the likelihood they could be mistaken as originals.

Fortunately, the law doesn’t make a great distinction between spammers and those who scrape with less nefarious intentions, but many who engage in this practice have, according to emails I’ve seen, have claimed to have an ethical or even legal right to engage in the scraping, calling themselves “editors”.

This has set the stage for some ugly battles that, while they haven’t reached the courtroom yet, have certainly been heated on the Web.

Indeed, this argument seems to be one that’s moving out of the courtroom and into the court of public opinion, a place where it’s likely to stay given how straightforward the legal issues seem.

Bottom Line

In the end, consider that the New York Times Company, one of the most powerful media institutions on the planet, couldn’t or didn’t want to defend scraping of just headlines and summaries, there’s little hope for a successful defense of full RSS scraping. This is especially true in the light of other, related scraping cases.

However, those who want to scrape and those who are willing to allow their feeds to be scraped do have options. Creative Commons, for example, has modules for RSS feeds that enable applications to detect what they are allowed to do with a feed.

To those who don’t wish to allow it, I encourage you to put in your feed itself a notice stating that you do not wish to allow republishing and that the feed is for private personal use only. Though it shouldn’t be necessary under the law, it’s a wise move that blocks many of the potential arguments a scraper might raise. Furthermore, such footers can greatly help with the detection of scrapers.

All in all, RSS scraping has definitely changed in terms of who is using it and why, but the threat isn’t all that different and the legal realities have hardly changed at all.

This means that RSS scraping can be easily fought, just that the people you’re moving against may be a bit more vocal in their views.

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free