Dapper: The Scraper for the Common Man

Sometimes, especially with Web 2.0 companies, jargon can get a little bit out of hand. When someone says that a service allows you to "build an API for any website", it can be a bit difficult to understand what that really means.

However, put simply, Dapper is a scraper. Nothing more. It allows you to scrape content from a Web page and convert it into an XML document that can be easily used at another location. Though you won't find the words "scrape" or "scraper" anywhere on its site, that is exactly what it does.

What separates Dapper from other scrapers, both legitimate and illegitimate, is that it is both free and easy to use. In short, it makes the process of setting up the scraper simple enough for your every day Internet user. While one has never needed to be a geek to scrape RSS feeds, now the technologically impaired can scrape content from any site, even those that don't publish RSS feeds.

Though the TechCrunch profile of the service says that Dapper "aims to offer some legitimate, valuable services and set up a means to respect copyright" others are expressing concern about the potential for copyright violations, especially by spam bloggers.

Either way though, both the cause for concern and the potential dangers are very, very real.

What is Dapper

When a user goes to create a new "Dapp", he or she first needs to provide a series of links. These links must be on the same domain and in similar formats (IE: Google searches for different terms or different blog posts on a single site) for the service to work. Once the links have been defined, the user is then taken to a GUI where they pick out fields.

In a simple example where the user would create their own RSS feed for a blog, the post title might be one field, perhaps called "post title" and the body would be a second, perhaps called "post body". Dapper, much like the service social bookmarking Clipmarks, is able able to intelligently select blocks of text on a Web page, making it easy to ensure that the entire post body is selected and that extraneous information is omitted.

Once the fields have been selected, the user can then either create groups based upon those fields or simply save the dapp for future use. Once the Dapp has been saved, they can then use it to create both raw XML data, an RSS feed, a Google Gadget or any number of other output files that can be easily used in other services.

If you are interested in viewing a demo of Dapper, you can do so at this link.

There is little doubt that Dapper is an impressive service. It has taken the black art of scraping and made it into a simple, easy-to-use application that just about anyone can pick up. Though it might take a few tries to create a working Dapp, and certainly spending some time reading up on the service is required, most will find it easy to use, especially when compared to the alternatives.

However, it's this ease of use that has so many worried. Though scrapers have been around for many years, they have been either difficult to use or expensive. Dapper's power, when combined with its price tag and sheer ease of use, has many wondered that it might be ushering not a new age for the Web, but a new age for scrapers seeking to abuse other's hard work.

Cause for Concern

While being easy to use or free is not necessarily a problem in and of itself, in the rush to enable users to make an API for any site, they forget that many sites don't have one or restrict access to their APIs for very good reasons. RSS scraping is perhaps the biggest copyright issue bloggers face. It enables a plagiarist or spammer to not only steal all of the content on the blog right then, but also all of the content that will be posted in the future. This is a huge concern for many bloggers, especially those concerned about performing well in the search engines.

This has prompted many blogs to either disable their RSS feeds, truncate them or move them to a feed monitoring service such as Feedburner. However, if users can simply create their own RSS feeds with ease, these protections are circumvented and Webmasters lose control over their content.

Even with potential copyright abuse issues aside, Dapper creates potential problems for Webmasters. It bypasses the usual metrics that site owners have. A user who reads a site, or large portions of it, through a Dapp will not be counted in either the feed statistics or, depending on how Dapper is set up, even in the site's logs. All the while, the site is spending precious resources to feed the Dapp, taking money out of the Webmaster's pocket.

This combination of greater expense, less traffic and less accurate metrics can be dangerous to Webmasters who are working to get accurate traffic counts, visitor feedback or revenue.

Worse still, Dapp users also bypass any ads or other monetization tools that might be included in the site or the original RSS feed. This has a direct impact on sites trying to either turn a profit or, like this one, recoup some of the costs of hosting.

Despite this, it's the copyright concerns that reign supreme. Though screen scraping is not necessarily an evil technology, it is the sinister uses that have gotten the most attention and, sadly, seem to be the most common, especially in regards to blogs.

Even if the makers of Dapper is aiming to add copyright protection