Excerpts, Scraping and Fair Use

nyt-logo

A recent post in the New York Times entitled “Copyright Challenge for Sites That Excerpt” has caused a controversy on the Web. The article, by Brian Stetler, talks about the frustrations of copyright holders, specifically large media companies, with sites such as the Huffington Post and Silicon Valley Insider, that routinely excerpt and repeat information from them with a link back.

The article quotes Joshua Benton, the director of the Nieman Journalism Lab at Harvard, as saying, “They (the news organizations) think they need to defend their turf more aggressively.”

This has become a very hot topic in journalism circles in the last few months. Last month, Scott Baradell, a former journalist and executive at Belo Media, wrote that newspapers are running out of time to deal with their “content theft” problem. It has also taken off in blogging/social media circles as well as yesterday, following the NY Times article, Allen Stern on Center Networks wrote an article about about why scraping will only get worse.

These are among dozens of op-eds and blog posts on this topic. However, the question that has been raised by all of them is simple enough, when does copying, even excerpted and attributed, become something that harms content creators and what, if anything, should they do about it?

Answering it, however, is far more difficult than asking it.

The State of the Scraping Web

In Stern’s follow-up, he wrote that there are two types of scrapers “Those who scrape full rss feeds and those who scrape just enough to keep you on their site and keep the conversation there as well.” Though he originally thought the first type was the bad kind and the second was tolerable, he now feels that both kinds “suck”.

If I have noticed any shift or changing scraping over the past three years it has been that there are fewer and fewer “whole feed” scrapers to be found. Though there are still many on the Web, there has been a huge shift to partial feed scraping. When I look through my FairShare feed for this site and other sites I monitor, I see almost no blogs scraping more than a few dozen words.

I have been timid about going after these sites though many are clearly just spam blogs. By using such short excerpts, even though most violate my CC license or the licenses of my friends, they are giving themselves a decent fair use argument and they aren’t hurting either in any meaningful way.

However, there has been a new group of sites that have begun to push the boundaries between scrapers and social news/networking sites. As Stern pointed out, these sites scrape content and then encourage users to to comment, vote and socialize on generally stay on their site rather than visiting the original. I instantly think of SocialMedian and Fav.or.it when I ponder these sites.

Though the two sites above scrape significant portions of the post and even hotlink images, two things that has raised a lot of concern among many bloggers, there have been a new crop of scrapers that take only a small part of the content and add social news features such as voting and commenting. Though these sites may look like Digg or Reddit, it is clear that the content is not user-submitted and the human element only comes in after the fact.

In short, what is happening is scrapers have been shifting their focus, trying to improve their legitimacy both in the eyes of copyright law and in terms of users. With that, some have turned their attention away from blind attempts to game the search engines toward keeping users and building repeat traffic.

This has been a constant headache for me in the past few months and education/outreach attempts have been largely unsuccessful at producing any real change.

The Human Element

The area that Stern only touched on, which was actually the crux of the original New York Times column, is the human aspect of this. Many blogs have built quite a reputation on gathering news from other sources, excerpting and linking to it. Huffington Post, my most accounts, is the poster child for this kind of blogging and has turned it into a very successful business, one with over 20 million pageviews per month according to the article.

Though the Huffington Post has begun to do its own reporting, much of its content is still just excerpts and links to other articles. The question is whether the links it provides back to the content sources is adequate “compensation” for the content used. Many in the news industry feel that these sites are “leeches”, doing nothing to pay for the reporting and writing of an article but building an entire businesses around their use of other’s content even though the only value typically added is the organization and any additional commentary.

Even though most still feel excerpting and linking is, on the whole, a positive force, many news organizations have begun to recoil and are re-raising paywalls and shying away from their open attitude on free content. It is clear that many have reached the conclusion that “free” is not a viable long-term strategy for content that costs money to produce for the Web.

The question though becomes, what about the rest of us? Whether we’re blogging for fun, to promote a business or to earn ad revenue, we still spend hours writing posts and creating new works. Excerpting can affect us too. Good excerpting can drive tons of traffic and visitors, bad excerpting takes it away. Furthermore, many bloggers, including myself with the “3 Count” series, use excerpting alongside with original commentary. We certainly don’t want to harm or detract from those we pull from.

But where is the line drawn and how can anyone tell what side of the fence they are on?

Finding Guidance

As the New York Times article discussed, fair use guidelines in the U.S. are far too ambiguous to be of much use in this area. There are no “hard line” rules with fair use and everything is decided on a case-by-case basis.

While that maximized the flexibility in the law, it also causes headaches when trying to set up simple rules. Attributor, a company I have consulted with, took a stab at setting some “bright line” rules on their blog, suggesting the following three rules:

  1. Excerpt must contain a link.
  2. Excerpt must use less than 50% of the original content.
  3. Excerpt must also use less than 100 words.

Though they admit that the exact numbers are up for debate, they feel that a combination of percentage and word count can be used to compile a good standard.

However, Silicon Valley Insider, one of the blogs mentioned in the New York Times article for its heavy use of excerpting, has taken a more “golden rule” approach to the problem and defined their excerpting policy as simply, “We excerpt others the way we hope others will excerpt us.” Though they open the door for excerpting well outside the boundaries of what Attributor laid out in their rules, they did say that they would work with content creators that feel the blog took too much or did not provide clear enough attribution.

What has become clear is that “fair” truly is in the eye of the beholder, something I strongly agree with in the Silicon Valley Insider post, and that what one blogger considers right another will consider wrong. For example, even though I’ve done everything I can to make the “3 Count” column as fair to the original reporters as possible, including large links, limited quoting and adding original commentary, many will likely think it to be bad form.

As true as it is, it doesn’t help bloggers that seek to reuse content while doing the right thing nor does it help those who’ve had their content use on borderline sites.

Excerpting is one giant gray area and it is getting uglier by the minute.

My Personal Experience

Recently, my article about CC0 took off, receiving a Slashdot and a mention on Ars Technica.

As a result of this, my article has been quoted and excerpted countless time. My FairShare feed has been lit up with quotes from the article over the past few days as it has appeared on many blogs of varying sizes. Nearly all of the uses have involved short quotes, almost always with proper links.

Though I have seen some traffic from those mentions, it certainly has only been a small percentage of visitors to those sites. It has been more of a trickle than a flood from these blogs.

However, this isn’t to say that I think those sites “ripped me off” or did anything wrong. On the contrary, I’m very glad that they liked the article enough to quote it and link to it. I’m very happy with how that article did and caught quite off guard by the attention it received.

On the sites that I visited, it seemed that everyone was doing what was both right and legal. Even if we discarded my CC license and assumed I wanted to stop this kind of behavior, fair use would probably have prevented me from taking any action. Even though I only received a small number of visitors, they are people who never would have found my site otherwise and the fact many visitors didn’t click through is not the fault of those that used the content. The visitors, for whatever reason, were not interested.

To me, proper excerpting is not about bright lines, but about symbiosis, making sure that the creators of the original work gain from your use as much as possible. Though I don’t think I am ready to accept the Silicon Valley Insider’s “golden rule” approach to the process, laws and feelings other than your own still have to be considered, I’m not in favor of bright line rules, such as Attributor’s either. Though the latter can be useful as loose guidelines, the type of thing editors would send to their reporters, but not as an absolute rule.

The reason excerpting doesn’t work as well as we would like is because visitors routinely don’t follow through to the source. Even the best use of an excerpt will only pass along a small percentage of visitors to the original story. Though there are exceptions to that rule, such as sites like Digg and Reddit (these are sites where the “easiest” path is to the original site), most of the time visitors don’t click through, it’s that simple.

As surfers, we are all guilty of that, we’ve all read a post with an excerpt and moving on without checking out the source. We probably do it hundreds of times a day without realizing it. First off, it is infeasible from a time standpoint, second, sometimes the excerpt is all we need or want.

That’s understandable, and it explains why even proper excerpting can feel like a very unbalanced relationship.

Bottom Line

If you’re a blogger or are otherwise posting content on the Web, you are automatically involved in three different ways. First, as someone who is going to have their work turned into excerpts, as someone who is going to use other people’s work, either as quotes or as research, and as a visitor reading excerpted content and deciding whether or not to click through.

Though bright line rules can provide some guidance, the only solution that is going to work is respect. Whenever we do any of these things, we have to show respect to the others involved. We have to work when using other people’s content to attribute correctly and ensure they gain from the use, we have to be respectful of other’s rights to use our content within some limited capacity and, as we surf the Web and share links, we have to work to ensure we find reference original sources or at least sites with original commentary.

Are mistakes going to be made? Yes. Are some people going to go too far and need to be called on it? Yes. But if we keep it in our minds that the people who do the research, write the articles and create the content deserve the lion’s share of the reward and let that guide our actions, for the most part we’ll be ok.

Though big content creators like newspapers and magazines are going to have to hammer out business strategies to keep their doors open as they transition to the Web, bloggers and smaller content creators have a different, albeit similar, set of concerns.

In the end, we all have to work together. The difference between the good guys and the spammers is that the latter doesn’t care if the creator gets any of their due. If they work within the law, it is only because it is easier, not because it is the right thing to do. Good neighbors on the Web consider these issues and, though they might reach conflicting conclusions, at least try to offer support back to creators.

We should focus as much on the true bad guys of the Web, they are the ones doing by far the most harm and, while this conversation is important to have for many reasons, it can’t be a distraction either.

There are plenty of real scumbags on the Web to fight.

21 Responses to Excerpts, Scraping and Fair Use

  1. Great post and thanks for the link. I like your suggested policy – 100 words seems like it might be a pinch too much but certainly better than the policy of some blogs to scrape as much as they want. So much of it comes down to respect.

  2. Welcome for the link, you wrote a great follow up and I wanted to cover it. I agree that the 100 limit might be high and they do admit that the exact number of words is up for debate, but I also agree the most important thing is to have a limit. It would at least avoid cases, as described in the article where 400 words was "excerpted". Hopefully we can figure this out and come up with some kind of semi-universal standard that works for everyone. It won't be easy though and it's going to mean having some difficult conversations…

  3. Pet Lizards says:

    Scraping content has been going on for a long time. Well before the WWW was invented. People used to scrape content out of books and news papers for reports. It's become an issue because in the internet age it is the equivalent of losing money.Lots of sites make their "bones" using content that other people have developed. Slashdot, Facebook, Myspace, Drudge, etc, etc, etc. All of them rely on either content from other sites, or user created content to drive the traffic.

  4. Thanks for the compliment Jonathan.Yea, I am so ready to have the conversations – so far I see absolutely no reason why someone like AI needs to scrape 400 words from a story other than the reasons I listed.

  5. Agreed, 400 words is over the top by any stretch, especially when you aren't adding significant commentary of your own. It's just plain greedy. The first tenant to fair use is to take only what you absolutely need, not what you feel you can.

  6. While this is true, the reason it equals a loss of money on the Web deals more with news cycles than anything. Where once it took days, if not weeks, to excerpt and redistribute news, now it only takes a few minutes. To a visitor, who may only read the news a few times per day, they are bombarded with choices about where to get it, most of them these days not performing much original reporting.Regarding sites such as Slashdot, Digg, etc. One distinction I do make is that, for them, the easiest path is to the original content. Less true on Slashdot, but on Digg the easiest path to the information is to visit the original site, look at the home page and the way the outbound links compare to the "comment" ones.I'm not saying that what UGC or excerpt-driven sites are doing is wrong, just that there needs to be a constant effort to create a symbiosis between those sites and creators.

  7. [...] Excerpts, Scraping and Fair Use (plagiarismtoday.com) [...]

  8. Great article as always!I have a question to raise: It's one thing to quote a source occasionally or when relevant, but how about semi-legitimate scrapers that do this repeatedly and create individual pages that consist of content that is scraped from a single source? And in some instances from what I have seen, they include every single post. So they are essentially creating a "mirror" of the original source.

  9. It's a tough question there are no easy answers to. The reason is that courts have really not taken this one up and it doesn't seem likely that a case is going to reach a court any time in the near future. The consensus seems to be that it is definitely spam and immoral, but due to the small amount of work being used, there is a strong fair use argument (though not necessarily a dead ringer) and that, combined with the limited damage, makes them not much worthy of targeting.My attitude is that there are enough full feed scrapers still to keep most of us busy for some time, going after the partial ones as anything other than a spam blog seems to be a waste to me. Hopefully we can get to a point though where they are the greatest of our worries.

  10. [...] “fair use” and “scraping the web” are terms mostly used when talking about copyright infringements for print media producers. The idea that citizen journalists can now report on news and other [...]

  11. [...] Fav.or.it, a content aggregation service, had earned a great deal of controversy among many bloggers. The site would collect content from various RSS feeds, at least in some cases including the full content, and display it on their site as well as offer visitors the chance to comment and discuss the news, away from the original site. This caused some to accuse Fav.or.it of using splogging as a business model and earned it several mentions on this site, including this one. [...]

  12. [...] of text content published on someone’s private web site and copy it onto your own. You may be allowed to use short excerpts or quotes for commentary and discussion purposes, according to US [...]

  13. John says:

    Thanks,This is just what I was looking for.I want to use a small excerpt and do so fairly, and now I have a decent idea of how to do so.

  14. John says:

    Thanks,This is just what I was looking for.I want to use a small excerpt in a fair way and and now I have an idea of how to do so.

  15. [...] via Excerpts, Scraping and Fair Use | PlagiarismToday. [...]

  16. [...] Read this post on PlagiarismToday to learn more about the rules of Fair Use.  You can also read the Wikipedia article on it. [...]

  17. I realize this is an old old article, but I am embroiled in this one now, via the "paper.li" service. Thanks for the article and the inisghts.

  18. [...] article “Excerpts, Scraping and Fair Use” says the “bottom line” in all this is that there [...]

  19. [...] Quote another blogger Still short on time and ideas? There are countless people out there generating interesting content about your industry. If you see something good, link to it from your blog and copy and paste an excerpt. It’s your right to share and comment on others’ work under Fair Use. This kind of linking builds relationships and link juice. Just don’t overdo it. [...]

Leave a Reply

STAY CONNECTED