Excerpts, Scraping and Fair Use

Jonathan BaileyMarch 3, 2009

9 minutes read

A recent post in the New York Times entitled “Copyright Challenge for Sites That Excerpt” has caused a controversy on the Web. The article, by Brian Stetler, talks about the frustrations of copyright holders, specifically large media companies, with sites such as the Huffington Post and Silicon Valley Insider, that routinely excerpt and repeat information from them with a link back.

The article quotes Joshua Benton, the director of the Nieman Journalism Lab at Harvard, as saying, “They (the news organizations) think they need to defend their turf more aggressively.”

This has become a very hot topic in journalism circles in the last few months. Last month, Scott Baradell, a former journalist and executive at Belo Media, wrote that newspapers are running out of time to deal with their “content theft” problem. It has also taken off in blogging/social media circles as well as yesterday, following the NY Times article, Allen Stern on Center Networks wrote an article about about why scraping will only get worse.

These are among dozens of op-eds and blog posts on this topic. However, the question that has been raised by all of them is simple enough, when does copying, even excerpted and attributed, become something that harms content creators and what, if anything, should they do about it?

Answering it, however, is far more difficult than asking it.

The State of the Scraping Web

In Stern’s follow-up, he wrote that there are two types of scrapers “Those who scrape full rss feeds and those who scrape just enough to keep you on their site and keep the conversation there as well.” Though he originally thought the first type was the bad kind and the second was tolerable, he now feels that both kinds “suck”.

If I have noticed any shift or changing scraping over the past three years it has been that there are fewer and fewer “whole feed” scrapers to be found. Though there are still many on the Web, there has been a huge shift to partial feed scraping. When I look through my FairShare feed for this site and other sites I monitor, I see almost no blogs scraping more than a few dozen words.

I have been timid about going after these sites though many are clearly just spam blogs. By using such short excerpts, even though most violate my CC license or the licenses of my friends, they are giving themselves a decent fair use argument and they aren’t hurting either in any meaningful way.

However, there has been a new group of sites that have begun to push the boundaries between scrapers and social news/networking sites. As Stern pointed out, these sites scrape content and then encourage users to to comment, vote and socialize on generally stay on their site rather than visiting the original. I instantly think of SocialMedian and Fav.or.it when I ponder these sites.

Though the two sites above scrape significant portions of the post and even hotlink images, two things that has raised a lot of concern among many bloggers, there have been a new crop of scrapers that take only a small part of the content and add social news features such as voting and commenting. Though these sites may look like Digg or Reddit, it is clear that the content is not user-submitted and the human element only comes in after the fact.

In short, what is happening is scrapers have been shifting their focus, trying to improve their legitimacy both in the eyes of copyright law and in terms of users. With that, some have turned their attention away from blind attempts to game the search engines toward keeping users and building repeat traffic.

This has been a constant headache for me in the past few months and education/outreach attempts have been largely unsuccessful at producing any real change.

The Human Element

The area that Stern only touched on, which was actually the crux of the original New York Times column, is the human aspect of this. Many blogs have built quite a reputation on gathering news from other sources, excerpting and linking to it. Huffington Post, my most accounts, is the poster child for this kind of blogging and has turned it into a very successful business, one with over 20 million pageviews per month according to the article.

Though the Huffington Post has begun to do its own reporting, much of its content is still just excerpts and links to other articles. The question is whether the links it provides back to the content sources is adequate “compensation” for the content used. Many in the news industry feel that these sites are “leeches”, doing nothing to pay for the reporting and writing of an article but building an entire businesses around their use of other’s content even though the only value typically added is the organization and any additional commentary.

Even though most still feel excerpting and linking is, on the whole, a positive force, many news organizations have begun to recoil and are re-raising paywalls and shying away from their open attitude on free content. It is clear that many have reached the conclusion that “free” is not a viable long-term strategy for content that costs money to produce for the Web.

The question though becomes, what about the rest of us? Whether we’re blogging for fun, to promote a business or to earn ad revenue, we still spend hours writing posts and creating new works. Excerpting can affect us too. Good excerpting can drive tons of traffic and visitors, bad excerpting takes it away. Furthermore, many bloggers, including myself with the “3 Count” series, use excerpting alongside with original commentary. We certainly don’t want to harm or detract from those we pull from.

But where is the line drawn and how can anyone tell what side of the fence they are on?

Finding Guidance

As the New York Times article discussed, fair use guidelines in the U.S. are far too ambiguous to be of much use in this area. There are no “hard line” rules with fair use and everything is decided on a case-by-case basis.

While that maximized the flexibility in the law, it also causes headaches when trying to set up simple rules. Attributor, a company I have consulted with, took a stab at setting some “bright line” rules on their blog, suggesting the following three rules:

Excerpt must contain a link.
Excerpt must use less than 50% of the original content.
Excerpt must also use less than 100 words.

Though they admit that the exact numbers are up for debate, they feel that a combination of percentage and word count can be used to compile a good standard.

However, Silicon Valley Insider, one of the blogs mentioned in the New York Times article for its heavy use of excerpting, has taken a more “golden rule” approach to the problem and defined their excerpting policy as simply, “We excerpt others the way we hope others will excerpt us.” Though they open the door for excerpting well outside the boundaries of what Attributor laid out in their rules, they did say that they would work with content creators that feel the blog took too much or did not provide clear enough attribution.

What has become clear is that “fair” truly is in the eye of the beholder, something I strongly agree with in the Silicon Valley Insider post, and that what one blogger considers right another will consider wrong. For example, even though I’ve done everything I can to make the “3 Count” column as fair to the original reporters as possible, including large links, limited quoting and adding original commentary, many will likely think it to be bad form.

As true as it is, it doesn’t help bloggers that seek to reuse content while doing the right thing nor does it help those who’ve had their content use on borderline sites.

Excerpting is one giant gray area and it is getting uglier by the minute.

My Personal Experience

Recently, my article about CC0 took off, receiving a Slashdot and a mention on Ars Technica.

As a result of this, my article has been quoted and excerpted countless time. My FairShare feed has been lit up with quotes from the article over the past few days as it has appeared on many blogs of varying sizes. Nearly all of the uses have involved short quotes, almost always with proper links.

Though I have seen some traffic from those mentions, it certainly has only been a small percentage of visitors to those sites. It has been more of a trickle than a flood from these blogs.

However, this isn’t to say that I think those sites “ripped me off” or did anything wrong. On the contrary, I’m very glad that they liked the article enough to quote it and link to it. I’m very happy with how that article did and caught quite off guard by the attention it received.

On the sites that I visited, it seemed that everyone was doing what was both right and legal. Even if we discarded my CC license and assumed I wanted to stop this kind of behavior, fair use would probably have prevented me from taking any action. Even though I only received a small number of visitors, they are people who never would have found my site otherwise and the fact many visitors didn’t click through is not the fault of those that used the content. The visitors, for whatever reason, were not interested.

To me, proper excerpting is not about bright lines, but about symbiosis, making sure that the creators of the original work gain from your use as much as possible. Though I don’t think I am ready to accept the Silicon Valley Insider’s “golden rule” approach to the process, laws and feelings other than your own still have to be considered, I’m not in favor of bright line rules, such as Attributor’s either. Though the latter can be useful as loose guidelines, the type of thing editors would send to their reporters, but not as an absolute rule.

The reason excerpting doesn’t work as well as we would like is because visitors routinely don’t follow through to the source. Even the best use of an excerpt will only pass along a small percentage of visitors to the original story. Though there are exceptions to that rule, such as sites like Digg and Reddit (these are sites where the “easiest” path is to the original site), most of the time visitors don’t click through, it’s that simple.

As surfers, we are all guilty of that, we’ve all read a post with an excerpt and moving on without checking out the source. We probably do it hundreds of times a day without realizing it. First off, it is infeasible from a time standpoint, second, sometimes the excerpt is all we need or want.

That’s understandable, and it explains why even proper excerpting can feel like a very unbalanced relationship.

Bottom Line

If you’re a blogger or are otherwise posting content on the Web, you are automatically involved in three different ways. First, as someone who is going to have their work turned into excerpts, as someone who is going to use other people’s work, either as quotes or as research, and as a visitor reading excerpted content and deciding whether or not to click through.

Though bright line rules can provide some guidance, the only solution that is going to work is respect. Whenever we do any of these things, we have to show respect to the others involved. We have to work when using other people’s content to attribute correctly and ensure they gain from the use, we have to be respectful of other’s rights to use our content within some limited capacity and, as we surf the Web and share links, we have to work to ensure we find reference original sources or at least sites with original commentary.

Are mistakes going to be made? Yes. Are some people going to go too far and need to be called on it? Yes. But if we keep it in our minds that the people who do the research, write the articles and create the content deserve the lion’s share of the reward and let that guide our actions, for the most part we’ll be ok.

Though big content creators like newspapers and magazines are going to have to hammer out business strategies to keep their doors open as they transition to the Web, bloggers and smaller content creators have a different, albeit similar, set of concerns.

In the end, we all have to work together. The difference between the good guys and the spammers is that the latter doesn’t care if the creator gets any of their due. If they work within the law, it is only because it is easier, not because it is the right thing to do. Good neighbors on the Web consider these issues and, though they might reach conflicting conclusions, at least try to offer support back to creators.

We should focus as much on the true bad guys of the Web, they are the ones doing by far the most harm and, while this conversation is important to have for many reasons, it can’t be a distraction either.

There are plenty of real scumbags on the Web to fight.

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free