Using Creative Commons to Stop Scraping

Jonathan BaileyJune 5, 2007

5 minutes read

Many sites, including this one, have expressed concerns that CC licenses may be encouraging or enabling scraping.

The problem seems to be straightforward. If a blog licenses all of their content under a CC license, then a scraper that follows the terms of said license is just as protected as a human copying one or two works. This may be within the letter of the license, but it violates the spirit of Creative Commons.

However, after talking with Mike Linksvayer, the Vice President of Creative Commons, I’m relieved to say that is not the case. CC licenses have several built-in mechanisms that can prevent such abuse.

In fact, when one looks at the future of RSS, it is quite possible that using a CC license might provide better protection than using no license at all.

Against the Spirit: A Crisis with the Commons?

Whether or not some scrapers target CC licensed material or not is up for debate, what is clear is that, when they do, it is often a source of frustration.

People choose CC licenses because they want to share their work with others. They want to participate in a cultural revolution and give their ideas new wings. They do not, generally, want to see their entire site mirrored elsewhere, surrounded by Adsense ads and depriving them of traffic.

Ideally, a CC license is supposed to be symbiotic. The licensor gives up certain rights to their work and the licensee, in exchange for use of the work, makes certain the original author gets due credit and is rewarded for his or her effort. Spam bloggers, however, approach the CC license in bad faith, taking as much as they can while giving as little as possible back.

This has prompted many CC license users to either drop or alter their license. It has become common for sites that are being scraped to change their licenses to “non-commercial”, stop using CC licenses or even shut down their sites altogether.

However, though these spam blogs do seem to be following the terms and conditions of the Creative Commons Licenses, even if by accident, the vast majority do not. In fact, even enabling commercial use of your work is not an open invitation to be scraped.

As it turns out, CC licenses have built in mechanisms that can be used to fight that kind of abuse.

Where Computers Fear to Tread

For the use of a CC licensed work to be valid, according to Linksvayer, the following terms must be met among others:

The work must be attributed and it must provide a link back to the copyright holder.
If the license is non-commercial, then the work must be used accordingly.
If a license has a share-alike term attached to it, then the copied work must express the same license.
All CC licensed material must state that it is licensed as such, usually with a statement that says “This work is licensed under a Creative Commons License”. Failure to do so puts the reuse in violation.
Finally, with Creative Commons, the licensor has the right to request removal of their name from any reused content, failure to comply puts the reuse in violation of the license.

The problem with all of this is that it is almost impossible for an automated scraping system to comply with all of these elements.

Though some spam bloggers do attribute and link back, most do not. All spam blogging, at least in theory, is a violation of the non-commercial license and, since no spam blogs I have seen carry over CC information, they are in violation of the attribution and share-alike attributes of the Creative Commons License.

Even if a scraper manages to comply with the first four mechanisms above, it is unlikely that, when asked, they would remove the name from any work they reused. Spammers, seeking to automate their operations, are unlikely to edit their spam blogs by hand to appease one copyright holder.

The result is that virtually all automated scraping and spam blogging is a violation of the Creative Commons License, regardless of what license is used.

Technicalities and Human Error

Some of these attributes, however, are relatively unknown. Though most people understand what is and is not acceptable with the various CC licenses, many of the nuances of using CC licenses, such as the fourth mechanism, are little known or followed, even by humans seeking to play fair.

However, most copyright holders, often in the dark about the requirements themselves, do not hold human copiers to these standards. So long as they get the attribution and reuse that they envision, they typically do not raise any alarms if there isn’t a “This work is licensed under…” statement in the reused content.

The question is whether or not it is fair to hold scrapers to a higher standard than we generally hold other people. While it certainly is the right of the copyright holder to determine which misuses of their work they follow up on and deal with, many will, likely, feel uneasy about using largely unenforced technicalities against spam bloggers.

But even those who feel uneasy about enforcing those elements of the Creative Commons license may still benefit from applying one, especially to their feed. With some very difficult questions about copyright and RSS feeds unanswered, having a defined license on your feed could be come critical.

Implied Licenses and RSS

Though most attorneys I know and have spoken with feel that there is no implied license to scrape and republish RSS feeds, the question has not yet come before a court and the outcome, as with all cases pushing new territory, is unpredictable at best.

However, if it is determined that RSS scraping and republishing is legal and that there is such an implied license with posting an RSS feed, attorney Denise Howell feels that any implied license can be overwritten by a defined one, such as a Creative Commons License.

This makes sense consider that an implied license is one designed to operate when there is no actual agreement exists between the parties. If a specific license is posted, it would override the implied license.

We see this already on the Web. By posting a Web page to the Internet, the courts have found that there is an implied license for it to be indexed and cached by the search engines. However, once you state your intention for the page to not be used in such a manner, either through meta tags, robots.txt or manual opt-out, the implied license is dropped and the search engines, legitimate ones at least, have to comply with your requests.

The Creative Commons Organization is working on means of integrating CC licenses into RSS feeds. Hopefully this issue will garter more attention as the legal issues mount and a more final draft can be fleshed out.

Conclusions

The bottom line is that Creative Commons does not encourage or permit blind RSS scraping and spam blogging. Though it might be useful for legitimate aggregation, Creative Commons provides a great deal of protection against scraping, much more than previously thought.

Whether or not these mechanisms prove useful in fighting scraping remains to be seen. However, there is no longer a reason to hold back on a DMCA notice or a copyright complaint just because your commercial CC license seems to permit the use. Unless the scraper followed all of the requirements above, the use is still invalid.

Hopefully this will encourage the wider use of CC licenses, specifically the use of more liberal ones. I myself have removed the non-commercial requirement from my CC license as, like many others, my primary concern was commercial use by scrapers.

In the end, this is just another example of how Creative Commons, when used correctly, can work well for everyone and, in many cases, is good copyright policy.

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free

Jonathan BaileyJune 5, 2007

5 minutes read

Want to Reuse or Republish this Content?

Follow us