5 Ways Web Scraping Has Already Affected You
Scraping is one of the most controversial activities on the Web. There are entire companies dedicated both to engaging in scraping (legitimate and otherwise) and companies dedicated to fighting or preventing it.
While some scraping, such as search engine spiders and other index services, are generally welcome, others, including spammers and competitors, generally are not.
This war between scrapers and anti-scrapers has largely taken place behind the scenes. Few casual Web users are even aware of what scraping is, let alone that it’s going on. However, as a recent article on GigaOm highlighted (thank you @franky for the link), the impact of scraping is going to have a growing impact on the Web, whether you operate a site or not.
However, there are already several key ways that scraping has been impacting the Web and the lives of everyday users. We don’t have to look to years down the road and the impending era of “Big Data” as much of the future is today.
Here are just five ways scraping and the way sites have reacted to it have been impacting everyday users of the Web.
1. Impossible CAPTCHAs
CAPTCHAs, which stand for “Completely Automated Public Turing test to tell Computers and Humans Apart” are tests that are designed to prevent robots (IE: Scrapers) from accessing a portion of a site. They typically take the form of a warped word or phrase that the human has to retype.
CAPTCHAs are seen as a part of life on the Web but they’ve been getting noticeably harder and harder. Some of it is due to poor design, but a lot of it is due to the fact that scrapers have been getting better and better at solving CAPTCHAs automatically.
Basically, every time a new CAPTCHA is released or an existing one is updated, there is a rush to write programs that can break it. This war has led not only to more and more creative CAPTCHAs but also more and more difficult ones, for humans and machines alike.
Next time you see a CAPTCHA you can’t solve easily, thank scrapers for it as they, in large part, are responsible.
2. More Sites Requiring Registration
If it seems like every site you visit wants you to register before you can do anything, it isn’t just because they want your information and a way to stay in contact with you (though that is part of it).
By forcing you to register sites make it easier to control their data and they can easily shutter abusive accounts. Also, they can get you to agree to a TOS that, most likely, forbids any kind of automated data collection, making scraping a violation of the terms.
Finally, anything that you have to log into see is hidden from the sharp eyes of wild scrapers without accounts, meaning that all spiders that roam to the site will only get the handful of public pages, regardless of whether they obey robots.txt.
In short, account registration keeps scrapers out both with legal and technological maneuvering.
3. Add to Cart to See Price
If you’ve ever been shopping online and a store has said “Add to Cart to See Price”, it’s not because the site has a secret sale, but because they are trying to keep the price safe from bots that scrape product information from store sites, often for the purpose of price comparison.
Shopping sites have often been wary of these aggregators and comparison sites, preferring that their site be the source users turn to to find a product rather than a third party. As a result, many have worked to frustrate these scrapers by making prices available only after an item is added to cart, which is something a scraper typically can’t or won’t do.
The idea being that this doesn’t just reward customers who are shopping the site organically with a lower price, but it also makes the comparison sites less useful. However, most consumers seem to find this extremely annoying, especially when the deal is not that exceptional.
4. More Hidden Fees
Speaking of shopping sites, another way that they have dealt with price comparison scrapers is to make greater use of hidden fees. The idea is to make the actual price of the product or service as low as possible so it will be scraped and used in comparisons, but then tack on additional fees later to compensate. Since scrapers typically can’t see or account for those fees, it makes the product appear cheaper than it is.
Once again though, this is a sneaky trick that angers a lot of consumers, but it is a trend born out of heavy price comparison, which, in turn, is born heavily from content scraping.
5. Your Data in Multiple Places
However, perhaps the most worrisome way that scraping has impacted casual Web users is by ensuring that their data is never just in one place.
While you might not run a blog or a site, you may have written an Amazon review, posted a forum, started a Twitter account, created a Facebook profile, etc. Any participation you’ve done online probably exists in multiple locations with little hope for removing all of them.
There have been many Facebook scrapers, Twitter makes its Firehose available for others to pull from and Amazon is one of the biggest targets for data mining.
In short, scraping virtually ensures you have no practical control over your data online. Once it’s public, there’s no telling where it is.
Bottom Line
With all of the talk about the next phase of the Web centering around “Big Data”, scraping is one of the more powerful and controversial tools in that era. It’s a way to obtain large amounts of data, often freeing it from a source where it can not be easily removed. Whether that’s ethical or not depends on how the content is scraped and for what purposes.
However, don’t expect the war between scrapers and anti-scrapers to end any time soon. It’s only going to heat up as data, at least in some predictions, becomes more consolidated and valuable. Internet users, unfortunately, will be caught in the middle.
While what this means in the long run is yet to be determined, if the above examples are any indication, it’s going to mean a lot of headaches for Web users in the months and years ahead.
Want to Reuse or Republish this Content?
If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.