Dapper: The Scraper for the Common Man

15

Sometimes, especially with Web 2.0 companies, jargon can get a little bit out of hand. When someone says that a service allows you to "build an API for any website", it can be a bit difficult to understand what that really means.

However, put simply, Dapper is a scraper. Nothing more. It allows you to scrape content from a Web page and convert it into an XML document that can be easily used at another location. Though you won't find the words "scrape" or "scraper" anywhere on its site, that is exactly what it does.

What separates Dapper from other scrapers, both legitimate and illegitimate, is that it is both free and easy to use. In short, it makes the process of setting up the scraper simple enough for your every day Internet user. While one has never needed to be a geek to scrape RSS feeds, now the technologically impaired can scrape content from any site, even those that don't publish RSS feeds.

Though the TechCrunch profile of the service says that Dapper "aims to offer some legitimate, valuable services and set up a means to respect copyright" others are expressing concern about the potential for copyright violations, especially by spam bloggers.

Either way though, both the cause for concern and the potential dangers are very, very real.

What is Dapper

When a user goes to create a new "Dapp", he or she first needs to provide a series of links. These links must be on the same domain and in similar formats (IE: Google searches for different terms or different blog posts on a single site) for the service to work. Once the links have been defined, the user is then taken to a GUI where they pick out fields.

In a simple example where the user would create their own RSS feed for a blog, the post title might be one field, perhaps called "post title" and the body would be a second, perhaps called "post body". Dapper, much like the service social bookmarking Clipmarks, is able able to intelligently select blocks of text on a Web page, making it easy to ensure that the entire post body is selected and that extraneous information is omitted.

Once the fields have been selected, the user can then either create groups based upon those fields or simply save the dapp for future use. Once the Dapp has been saved, they can then use it to create both raw XML data, an RSS feed, a Google Gadget or any number of other output files that can be easily used in other services.

If you are interested in viewing a demo of Dapper, you can do so at this link.

There is little doubt that Dapper is an impressive service. It has taken the black art of scraping and made it into a simple, easy-to-use application that just about anyone can pick up. Though it might take a few tries to create a working Dapp, and certainly spending some time reading up on the service is required, most will find it easy to use, especially when compared to the alternatives.

However, it's this ease of use that has so many worried. Though scrapers have been around for many years, they have been either difficult to use or expensive. Dapper's power, when combined with its price tag and sheer ease of use, has many wondered that it might be ushering not a new age for the Web, but a new age for scrapers seeking to abuse other's hard work.

Cause for Concern

While being easy to use or free is not necessarily a problem in and of itself, in the rush to enable users to make an API for any site, they forget that many sites don't have one or restrict access to their APIs for very good reasons. RSS scraping is perhaps the biggest copyright issue bloggers face. It enables a plagiarist or spammer to not only steal all of the content on the blog right then, but also all of the content that will be posted in the future. This is a huge concern for many bloggers, especially those concerned about performing well in the search engines.

This has prompted many blogs to either disable their RSS feeds, truncate them or move them to a feed monitoring service such as Feedburner. However, if users can simply create their own RSS feeds with ease, these protections are circumvented and Webmasters lose control over their content.

Even with potential copyright abuse issues aside, Dapper creates potential problems for Webmasters. It bypasses the usual metrics that site owners have. A user who reads a site, or large portions of it, through a Dapp will not be counted in either the feed statistics or, depending on how Dapper is set up, even in the site's logs. All the while, the site is spending precious resources to feed the Dapp, taking money out of the Webmaster's pocket.

This combination of greater expense, less traffic and less accurate metrics can be dangerous to Webmasters who are working to get accurate traffic counts, visitor feedback or revenue.

Worse still, Dapp users also bypass any ads or other monetization tools that might be included in the site or the original RSS feed. This has a direct impact on sites trying to either turn a profit or, like this one, recoup some of the costs of hosting.

Despite this, it's the copyright concerns that reign supreme. Though screen scraping is not necessarily an evil technology, it is the sinister uses that have gotten the most attention and, sadly, seem to be the most common, especially in regards to blogs.

Even if the makers of Dapper is aiming to add copyright protection at a later date, the service is fully functional today and, though the FAQ states that they will "comply with any verified request by the lawful owner of the content to cease using his content," there is no opt-out procedure, no DMCA information on the United States Copyright Office Web site, no information on how to prevent Dapper from accessing your site and nothing but a contact page to get in touch with the makers of the service.

(Note: An email sent to the makers of Dapper on the 22nd has, as of yet, gone unanswered)

In addition to creating a potential copyright nightmare for Webmasters the site seems to be setting itself up for a lawsuit. In addition to not being DMCA Safe Harbor compliant (PDF), thus opening it up to copyright infringement lawsuits directly, the service seems to be vulnerable to a lawsuit under the MGM v. Grokster case, which found that service providers can be sued for infringement conducted by its users if they fail an "inducement" test. Sadly for Dapper, simply saying that it is the user's responsibility is not adequate to pass such a test, as Grokster found out. The failure to offer filtering technology and encouragement to create API's for "any" site are both likely strikes against Dapper in that regard.

To make matters more grim, copyright is not the only issue scrapers have to worry about, as one pair of lawyers put it, there are at least four different different legal theories that make scraping illegal including the computer fraud and abuse act, trespass against chattels and breach of contract. All in all, copyright is practically the least of Dapper's problems.

When it's all said and done, there is a lot of room for concern, not just on the part of Webmasters that might be affected by Dapper or its users, but also its makers. These intellectual property and other legal issues could easily sink the entire project.

Conclusions

It is obvious that a lot of time and effort went into creating Dapper. It's a very powerful, easy to use service that opens up interesting possibilities. I would hate to see the service used for ill and I would hate even worse to see all of the hard work that went into it lost because of intellectual property issues.

However, in its current incarnation, it seems likely that Dapper is going to encounter significant resistance on the IP front. There is little, if any protection or regard for intellectual property under the current system and, once bloggers find out that their content is being syndicated without their permission by the service, many are likely to start raising a fuss.

Even though Dapper has gotten rave reviews in the Web 2.0 community, it seems likely that traditional bloggers and other Web site owners will have serious objections to it. Those people, sadly, most likely have never heard of Dapper at this point.

With that being said, it is a service everyone needs to make note of. The one thing that is for certain is that it will be in the news again. The only question is what light will it be under.

tags: , , , , , ,

Want to Republish this Article? Request Permission Here. It's Free.

Have a Plagiarism Problem?

Need an expert witness, plagiarism analyst or content enforcer?
Check out our Consulting Website

15 COMMENTS

  1. Hi Jonathan,
    I actually sent you an email in an attempt to answer all of your concerns, even more so, explaining how we in fact will be providing a new monetizing route to content owners and certainly not infringing their rights. It seems that you haven’t got it, which probably caused the above post.
    I’m re-sending you my thorough reply again, and I’ll wait 24 hours for a ping from you that you got it. In case it fails again, I’ll just post it here to make sure we get the message across.
    One last point. As we put up all of the necessary mechanisms for opting out smoothly etc. in the meantime anyone can just send us a message through the contact page of our site, http://www.dappit.com.
    In case you think you didn’t get a response fast enough, here’s my private email address that you can also send requests to: eran@dappit.com
    Hopefully my email will answer many of the puzzles that you raise and explain how Dapper has no intention of infringing anyone’s rights, but rather empowering the content owners.
    Best,
    Eran

  2. Hi Jonathan,
    I actually sent you an email in an attempt to answer all of your concerns, even more so, explaining how we in fact will be providing a new monetizing route to content owners and certainly not infringing their rights. It seems that you haven’t got it, which probably caused the above post.
    I’m re-sending you my thorough reply again, and I’ll wait 24 hours for a ping from you that you got it. In case it fails again, I’ll just post it here to make sure we get the message across.
    One last point. As we put up all of the necessary mechanisms for opting out smoothly etc. in the meantime anyone can just send us a message through the contact page of our site, http://www.dappit.comwbvtefzzwrezwub.
    In case you think you didn’t get a response fast enough, here’s my private email address that you can also send requests to: eran@dappit.com
    Hopefully my email will answer many of the puzzles that you raise and explain how Dapper has no intention of infringing anyone’s rights, but rather empowering the content owners.
    Best,
    Eran

  3. Regardless of how good or bad Dapper is, I think it has shed us light that what currently available information in the web today just can’t cope the demand for a more processable and transformable information. If we’ve been undergoing the age where people can choose how they would publish their stuff (using blog that is), then these Dapper days would be the day where people want to able to choose how they would like to consume the information. Semantic web term has been there for darn too long to be ignored, I simply cannot blame Dapper for giving the enjoyful tool to make the internet more meaningful. 🙂

  4. This has been done before and it was done back in 1999 called Octopus.com http://news.com.com/2100-1023-236024.html It was a company before its time, way before its time and I’m surprised it took someone this long to even think about trying it again.

    I loved Octopus.com and I love Dapper now. Unfortunately it wasn’t legal then and it isn’t legal now. What they are doing is like reprinting an entire library of books and selling them and then saying they will gladly remove an authors book or give them a share of the profit if they ask.

  5. This has been done before and it was done back in 1999 called Octopus.com http://news.com.com/2100-1023-236024.html It was a company before its time, way before its time and I’m surprised it took someone this long to even think about trying it again.

    I loved Octopus.com and I love Dapper now. Unfortunately it wasn’t legal then and it isn’t legal now. What they are doing is like reprinting an entire library of books and selling them and then saying they will gladly remove an authors book or give them a share of the profit if they ask.

  6. I think most people are missing the point. Everyone is talking about plagiarism, but what about creating really useful aplications from multiple sources still not syndicated, like an event calendar for a given area. This kind of things are really easy to do with Dapper. Even more, if a spammer is really interested in stealing your content, he would do it, is spite of the tools (nowadays is very easy to create a good spider).

  7. Alveo
    I blog frequently and I seriously thank you for your information. Your article
    has really peaked my interest. I’m going to take a note of
    your website and keep checking for new information about once per week.
    I opted in for your Feed too.

LEAVE A REPLY

Please enter your comment!
Please enter your name here