Update: See this article for important updates about ScrapeGoat.
There are many people that will gladly sell you scraping software. At last count, there were at least a dozen scraping packages out there, ranging from basic RSS feed scraping/autoblogging applications to full-fledged site scrapers that can strip content out of any templated site.
Webmasters, especially of larger sites, have grown accustomed to these scraping applications and have taken actions against them including cloaking content, banning IPs, truncating RSS feeds and making their templates harder to scrape. Scrapers, on the other hand, have improved their software to improve their effectiveness, creating a game of cat and mouse that has become all-too-familiar on the Web.
At least one company, however, has gone a different direction with selling scraping technology. What they offer isn't so much an application, but a service that creates custom scraping programs. Their site, ScrapeGoat, is a chilling reminder of how far some people will go (and how much they will pay) in order to get your content.
What ScrapeGoat Does
ScrapeGoat describes itself as a "data extraction company" that custom builds data capturing, data harvesting and scraping solutions. In short, they create, support and maintain custom applications that extract data from one site and put it onto another, in a format that the purchaser can use. This can take many different formats, but almost all of them raise serious legal and ethical questions.
Worse still, ScrapeGoat actively acknowledges that many sites do not want their content scraped. However, rather than heeding Web site's wishes, they offer services to circumvent blocks put up by content owners and to keep its customer anonymous. These services include anonymous proxies that can make it difficult, if not impossible for a site to block the scraper or report them to their host.
However, ScrapeGoat's services do not come cheaply. While one has to request an estimate to get most of their pricing. they do indicate in their services list that you can "Purchase blocks of maintence hours for as little as $75/hr." If those rates are comporable in any way to the rates charged for new projects, it's easy to see how a program that might take anywhere from a few hours to a few months to build will likely cost hundreds, if not thousands of dollars, many times more expensive than standalone scraper applications.
The cost, however, has not been a deterrent to ScrapeGoat's customers. At least a few have put down the cash to have ScrapeGoat create their custom scraping tool.
But even those with the money to spend on custom scraping software should probably be worried about the potential legal questions that they raise. Scraping software has been tested many times in the court and the outcome has almost never been favorable to the scraper.
The Legality of ScrapeGoat
On at least two pages of its site, ScrapeGoat says the following: "Data scraping from public websites is very common and in most cases is 100% legal."
The statement, from my reading, is at best misleading and, at worst, is completely untrue.
While it is true that scraping is very common on the Web and that most of it is considered legal, or at least tolerated, this is not because scraping itself is legally sound practice. As one lawyer put it, the reasons are "economic, not legal". Many sites allow scrapers to carry out operations so long as A) the end user is directed to their site to make the sale and B) it doesn't place an undue burden on their servers.
In short, many sites, especially businesses, have a favorable viewpoint of scraping so long as it is done with permission and is cooperative in nature.
Unfortunately for scrapers though, the scraping cases that have gone to court have decidely favored the content creators.
One recent case, involved American Airlines suing Farechase Inc. Farechase was, and still is, a site that searches several airlines in an attempt to find the best fare. In that regard, it works much like Expedia and similar sites. However, Farechase was scraping fare and flight information from American Airlines without permission and was doing so with every search of the service. This, according to American Airlines, produced tens of thousands of searches on their site every day and, after technological steps failed to stop Farechase, American Airlines decided to sue.
Though no damages were won in this particular case, American Airlines did receive an injunction against Farchase, forbidding them from scraping the site.
As it turns out, there are several potential legal problems that come from unauthorized scraping of a site.
- Copyright: Copyright is the most obvious potential problem. If the work scraped is copyrightable and it is reposted on another site or service, there is a very good chance that the scraping is a violation of copyright law. RSS scraping, for example, is traditionally handled by applying existing copyright law.
- Tresspass to Chattels (IE: Property): Tresspass to chattels is tort by which an individual "interfered with another person's lawful possession of a chattel". One can be found guilty of such tresspass if their use of it dispossesed the chattel from its owner, deprived the owner use of it, impared the usefulness of it or brought harm to the person of the owner. This law has been used regularly to fight both spammers and scrapers since both impare the use of the target server. This was the lynchpin on which the American Airlines case turned and most courts have found that unauthorized scraping is a tresspass (Note: Some have even hinted that it might be a criminal matter as well as a civil one).
- Computer Fraud & Abuse Act (CFAA): The CFAA is a 1994 law (updated twice) that is designed to reduce hacking, especially of government systems, by both defining offenses and setting forth strict punishments. Some courts have hinted that scraping, at least in some cases, might be a violation of the CFAA, especially scrapers that pull from financial institutions and/or sites dealing with foreign communication, which most sites do, at least on some level. The CFAA is a criminal law, not a civil one (though it does list some civil penalties, and it offers sentences up to ten years upon conviction. Though it is unclear how the CFAA would apply, one theory is that it could be used to provide validation for "browsewrap" licenses (the terms of service at the bottom of various sites) that have previously had no enforceability due to the lack of an electronic signature.
Clearly, there are a lot of potential legal problems regarding scraping. Even the three example applications listed on ScrapeGoat's site, none of which work at this time, raise either potential copyright or tresspassing concerns, or both.
ScrapeGoat addresses these difficult legal questions by doing four things:
- Promising to stop scraping from any site that writes them to ask them to do so (Though how one would know who to complain to when the scraper uses an anonymous proxy is unclear).
- By stating in their FAQ that they "assume that the scraper you ask us to build will be used legally and ethically, and that you have obtained permission to use it on the targeted data source (when necessary). We reserve the right to refuse service to anyone wishing to use our scrapers in an illegal manner."
- Stating in their terms that they "will never knowingly build or host any scraper that is obviously illegal."
Clearly, ScrapeGoat realizes the extreme potential for abuse of this service and has worked to distance itself from the dangers. However, as the MGM v. Grokster case has shown, vendors can be held accountable for infringements that their users perpetrate. While their language appears to be to be an attempt to pass the "Induce Test" the Grokster case set up, it is their actions that will carry the most weight, not their words.
In that regard, the jury is still out.
A Bit of Hypocrisy
One strange note to the whole case is buried inside ScrapeGoat's terms and conditions. There, you will find this quote (emphasis added):
You agree that you will not use any robot, spider, other automatic device, or manual process to monitor or copy our web pages or the content contained herein without our prior expressed written permission. You agree that you will not use any device, software or routine to bypass, interfere or attempt to interfere with the proper working of the ScrapeGoat.com site or any activities conducted on our site. You agree that you will not take any action that imposes an unreasonable or disproportionately large load on our infrastructure.
In short, the ScrapeGoat terms and conditions forbids you from scraping their own site.
If scraping is so "legal", "ethical" and doesn't damage the targeted sites, ask yourself
why does a scraping services provide prohibit you from scraping their own site?
There is little doubt that ScrapeGoat is a potentially dangerous service for content creators. The question is not what harm can they do, but rather, if ScrapeGoat is the sign of a new trend in scraping technology.
It's very likely that, as RSS scrapers become less effective and Web sites become more complex, that we will see an end to "one size fits all" scrapers and move into a world where custom services, either programmed by the scraper or contracted out, are much more popular. While this would certainly make it harder for scrapers to get started and move the target of scraping to much larger sites (most blogs simply wouldn't be worth the effort), it would give a new lethality to the scraping industry.
Most likely though, this is just an offshoot of the existing scraping industry, which seems to be losing favor among black hat SEOs. It's a designer solution targeted at a very specific audience.
However, it's an audience that Web developers need to worry about, after all, they are likely to be the most hardcore scrapers.
(Note: An email for comment sent to ScrapeGoat has gone unanswered after several days despite an on-site promise to respond within 24 hours. At this time, I have no comment from Scrapegoat and would be thrilled if they could prove my theories wrong)