It’s been a busy month in the world of scraping. One previously covered screen scraping service released a new set of tools targeted at Webmasters and another major player entered the market with their own service.
For content owners, it is both a reason to cheer, and a reason to be concerned. There’s never been more ways to make content available to the public and get information into the hands of interested party. On the other hand, it’s never been more difficult to stop unwanted and illegal use of your content.
The questions becomes clear. How do these services work? How can we, as content producers, take advantage of them? And how can we prevent content theft through them?
The answers are not as simple as they might seem.
Dapper Caters to Webmasters
In stark contrast to its initial release, which did not even offer an opt-out procedure for Webmasters, Dapper has created a whole new set of tools that enable Webmasters to not only opt out of the service, but also license their content in ways that are suitable to them.
Currently, the site offers the following licensing options:
- Exchange for Link Back (Commercial use may be prohibited)
- Public Domain
- Explicit Permission Required (Opt in only)
- No Access (Opt out complete)
Dapper also says it is working on services that will license content both for direct payment and for shared revenue.
Noticeably absent are the Creative Commons Licenses, especially the no derivatives and share-alike ones, as well as any specification to what content can be used, for example images but not text or vice versa. Also, Dapper sets the rules uniformly across an entire domain, meaning that it is impossible as of right now to distinguish between different sites hosted on the same domain. This could be a major roadblock for sites on free hosting services.
But the biggest problem is that Dapper, at this time, requires you to register an account to select a license. They have removed the old opt-out page, the one that didn’t require an account, and now handle all licensing through this process. (Note: All sites opted out previously to this change are still excluded)
Though account registration is very simple, the default setting for sites that do not select a license is basically “public domain”, allowing all use of the content. Also, according to Dapper’s FAQ, they do not provide any enforcement of the licenses at this time. They do, however, say they plan to offer that in the future.
In many ways, this is actually a step backwards for Dapper. Opting out is now more difficult and, though Webmasters who register have more control, those who don’t still have none. An ideal solution would be to set all new sites to “Explicit Permission” and have the user wanting to create the Dapp get permission from the site before doing so. This would only need to be done once per site. Also, Dapper needs to be able to distinguish between different sites on the same domain, lest millions of relatively independent sites get lumped together (think Myspace).
Lawyers seem to agree that there is no implied license with RSS, and definitely not one with HTML, Dapper is still skating on dangerous ice and needs to rethink how it approaches unregistered sites.
This past Thursday, Yahoo! released it’s own scraping technology, Yahoo Pipes. The makers of Yahoo! Pipes describes it as follows:
Pipes is a hosted service that lets you remix feeds and create new data mashups in a visual programming environment. The name of the service pays tribute to Unix pipes, which let programmers do astonishingly clever things by making it easy to chain simple utilities together on the command line.
In many ways it is like Dapper, allowing users to pull from various data sources, remix the information and then put it out in a format that is human and/or machine-readable. However, Pipes is currently limited only to RSS feeds and does not scrape from HTML. Also, it offers three very easy ways to opt out of the service including:
- Configure your Web server to block Yahoo! Pipes
- Add a meta tag to your feed
- Email pipes-optout at yahoo-inc.com
The first two methods are pretty much useless on sites that host their feeds with FeedBurner or use free services that do not allow them to directly manipulate their files or server settings. However, the last one is good for all feeds and can distinguish between feeds hosted on the same domain.
That being said, Pipes does still, by default, allow use of RSS feeds that have not been opted out. There is no licensing assistance for Webmasters at this time and Yahoo! admits to ignoring robots.txt files as Pipes “is not a web crawler”. Pipes can, very easily, be used to create spam blogs and other scraped sites in a way that is difficult for Webmasters to detect and stop.
Still, since Pipes is limited to only using content that has already been syndicated in one format or another, the risk for theft is less. Most of it’s splogger-friendly features are already available on Technorati watchlists and Google Blog Search feeds.
Despite that, I would still like to see Yahoo! show more respect for RSS feeds that are out there. It claims to not be a Web crawler but at the same time feels free to scrape anything that isn’t deliberately opted out, similar to what a Web crawler does. It would be nice to at least see Yahoo! follow search engine conventions when dealing with Pipes, especially considering the potential for infringement and harm from that infringement is much higher with Pipes than it is with their traditional search product.
What is needed, more than anything, is a convention in this area. Robots.txt helped establish a convention for indexing sites in search engines, a similar standard needs to be created for scraping and reusing content by other sites. Such a convention would have to be included in both HTML pages and RSS feeds, as well as
Though the Creative Commons Organization has attempted to create such a standard, so far the major players in this area are ignoring their licenses. Something new needs to be created, both for the sake of Webmasters and the sake of sites such as Dapper and Pipes that wish to reuse content.
Simply put, since the law takes the strictest approach possible to these types of reuse, for mashups and blending sites to survive over the long term, there needs to make intentions clear and be sure that they are followed.
Otherwise, we’re just waiting for the first case to go to court and then things will get very ugly.