Just How Easy is it to Scrape Content?

Jonathan BaileyJune 14, 2012

2 minutes read

We’ve talked a lot on Plagiarism Today about RSS scraping, meaning how spammers and other infringers can use your RSS feed as an easy way to access and republish your site’s content without permission.

However, scraping doesn’t just happen in the RSS feed. Any content that is visible on the Web can be scraped with a little bit of effort. The only difference is that, since RSS is standardized, there are many off-the-shelf applications that can scrape and republish content without much setup.

But just how hard is it to set up and run a non-RSS scraper? What if, instead of content from a blog’s feed, you wanted the prices from a competitor’s site or all of the data stored in another site’s records? The team from Distil, an a company that provides anti-scraping protection for websites (previous coverage), sought to find out.

They decided how much effort and expense it would take to have someone scrape records from a series of databases and produce a list of individuals that met a set of criteria. To do this, they chose two free and open-to-the-public datasets that were searchable but not easily grabbed as a whole. They then put out a bid on a freelancer site for the project and, after 1 day, $48 and 93,000 records captured, they had the list they wanted.

When it was all said and done, it took almost no time and very limited technical expertise on their part to get the records they wanted. The same trick can be used on just about any site, whether you’re wanting to grab all of the static content on a blog, images in a photo gallery or anything else available on the Web but not in an RSS feed.

On the flip side, if Distil was able to buy access to the records for $48, this means that others out there can do it for much cheaper. It shows just how easy it is for those who are tech savvy to build custom scrapers and run them, grabbing everything on the site they want.

If anything, I have a feeling Distil might have overpaid for the service, which likely only took a few minutes to set up and run.

All in all, it’s a cautionary tale for any webmaster that just because your content isn’t in an RSS feed doesn’t mean that it’s safe. If it’s valuable enough for someone to want it and it is publicly visible, they can get it.

Watch the full video below.