Web scraping, the use of automated tools to grab content from web sites for use elsewhere, is a topic we haven’t looked at in quite sometime.
In the early-to-mid 2000s, it was a popular tool for spammers to quickly fill websites and blogs with endless reams of copied content. It was often paired with article spinning as a way to generate “new” articles based on the original content in hopes of further fooling the search engines.
At the time, the trend was largely driven by the popularity of RSS. Sites would provide RSS feeds, either partial or full, so their users could subscribe to their sites via services like Google Reader. However, RSS also made it easy to access the raw content of the site, making it extremely easy to scrape and republish.
Unfortunately for the scrapers, times changed. In 2011 Google made a series of shifts that demoted low quality sites, including scraper sitesxxxrttasfewevtcvzbudcb. This made scraping (along with article marketing and other tactics) untenable as a search engine strategy. To make matters worse, RSS usage was also falling out of favor, with even Google reader closing its doors in June 2013.
When, in 2013, an Israeli court ruled that RSS scraping was legal (at least under certain conditions), the decision was largely irrelevant. Spammers and creators alike had largely moved on.
But that doesn’t mean that web scraping stopped. Scrapers got more advanced and began scraping directly from web pages, one even going as far as to scrape me. Tools for protecting websites against scraping became popular and the battle shifted from one protecting copyrighted material and preventing spammers to one of protecting valuable data.
To that end, we have an update on that battle as Ninth Circuit has ruled web scraping is not a violation of the Computer Fraud and Abuse Act (CFAA), poking a significant hole in one of the key legal theories that caused scraping to be dubbed a “Legal Minefield” in the 2006.
So, what does this ruling mean and will it lead to a new rise in web scraping? Let’s take a closer look to find out.
HiQ vs. LinkedIn: A Battle Over Scraping
HiQ Labs is a tech company that aims to help employers better understand their workforce by scraping publicly-available information to give them useful data. One of the main sites they pull from is Microsoft-owned LinkedIn, which took umbrage to hiQ’s use of their data.
Back in 2017, LinkedIn sent hiQ a cease and desist letter demanding that the company stop scraping data from LinkedIn and its user profiles. One of the main arguments was that hiQ was violating the CFAA as the scraping was against their terms of service.
The CFAA was an act originally passed in 1986 as a means of outlawing unauthorized access to computers that contained sensitive information. The act has been amended seven times since but a key one was in 1996, when it was expanded to cover far more computers.
The core of the CFAA is that it makes it illegal to access a computer without authorization or in excess of authorization. One of the key legal theories was that scraping went against most sites’ terms of service, meaning that such behavior could be a violation of the CFAA since it was unauthorized access.
However, the Ninth Circuit, in upholding a lower court decision, looked at the issue a bit differently. According to the ruling, any member of the public had access to the public information on LinkedIn. LinkedIn argued that it could revoke that permission with a cease and desist letter, as it tried to do in 2017, but the court ruled that ignoring such a letter is not the same thing as hacking into a private system.
To make matters even worse for LinkedIn, the court barred it from taking any steps to block or stop hiQ’s scraping while the case is ongoing. HiQ claimed that such blocking would interfere contracts they have with their customers and they might be out of business before the case concludes if they took such action. However, LinkedIn may be able to take such steps after the case is over.
While the case might seem to be a major legal disaster for fighting scraping, there are some key limitations to keep in mind.
Limitations of the Ruling
The ruling ultimately deals with a fairly specific set of circumstances, facts that may or may not apply to various scraping others are dealing with.
- Public Data Only: All of the data hiQ scraped was public data, available to anyone on the web regardless of whether they have a LinkedIn account. Nothing in the data was behind any kind of walled garden, pay or otherwise.
- Non-Copyright-Protected Content: The data involved appears to mostly, if not exclusively, be facts and information not protectable under copyright. However, even if there is content protected by copyright, it would be owned by the LinkedIn users, not LinkedIn itself. Remember in 2012 when Craigslist tried to get users to assign copyright to it on posts? This was part of why.
- Previously Allowed: HiQ also argued that, for years, LinkedIn had tacitly accepted their scraping, even attending conferences with there company openly acknowledged its services were based on LinkedIn data.
Obviously, this case doesn’t deal with the kinds of scraping that were so common 10-20 years ago. Taking content from a website and republishing it, even with “spinning”, is still likely a copyright infringement and there are plenty of legal tools to stop it without the CFAA.
This case deals more with the battle over data. While here it’s user data from LinkedIn, it’s just as easily could have been pricing data from shopping sites, viewing metrics from a video site or anything to the like.
Companies find this data very important and part of that value is making it available to the public. However, while they are happy to share pieces of it at a time, they are far more protective of the aggregate and that’s what scrapers can do, take the pieces and automatically extrapolate the aggregate.
However, this case hints that companies may not be able to rely on the CFAA and their terms of service for protection. Instead, the battle is likely to be one over technical measures rather than legal protections.
Still, this case is far from over and, so far, only one circuit has weighed in. Cases like this have been historically very rare but are likely to get more common as companies seek to exploit this ruling or seek to counter it.
Whether or not this is an issue heading for the Supreme Court remains to be seen, but it’s definitely a case and an issue to watch.
If you’re a webmaster or content creator worried that this might usher in a new era of web scraping and spamming, I wouldn’t be concerned. The case doesn’t really deal with any copyright issues and, besides, Google pressures and technology changes are going to keep that behavior to the fringes, at least for now.
This is more a battle over data and who has the right to access it. Specifically, who has the right to access publicly-available data and what they can do with it.
This is still an important battle but one that most creatives are ultimately just in the crossfire of. Though scraping really isn’t much of a danger to the average webmaster in 2019, it’s still going to play a major role in the future of the internet.
The scraping battles of the 2000s may be long gone, but the ones of the 2010s and 2020s could wind up being the more important ones long term.