When APIs Attack

Jonathan BaileyDecember 5, 2006

6 minutes read

What do the following sites have in common: Flickr, Technorati, Icerocket, Google Blog Search and Youtube

Answer: They all provide feeds to distribute content they do not own.

While this has gone to great lengths to improve usability of these sites and has helped simplify thousands of people’s lives, it’s also created a headache for many content creators.

Because, since these forms of distribution are outside the control of the content creator, they have become popular targets of scrapers and spammers that love the keyword rich content these APIs and feeds provide.

Powerless to stop them, bloggers and Webmasters can only sit and watch as their words are plagiarized, used to peddle junk or generate artificial search engine credibility.

The Problem

If you run your own Web server or use a paid host, you have a great deal of control with your site. If you find that someone is scraping your feed, you can take steps to stop them.

However, once that content leaves your site, it is completely out of your control. If a search engine decides to offer an RSS feed or a third party site offers an API, those who want to steal your content rather than read or use it are free to do so with relative immunity.

Google, Technorati and Icerocket have all shown relatively little interest in putting a stop to this kind of scraping. Though they do a decent job keeping the results out of their own search engines, they seem to be fairly happy providing a stead stream of keyword-rich content to the spammers to anyone else that might pick it up.

This, in turn, has put copyright holders in a strange bind, forced to make a decision between protecting their work and the services they depend on to get their work out there.

An Easy Choice

Consider the choice most Webmasters face. They must either:

Prevent all search engines, especially blog search engines, from accessing your feed and submitting it to their APIs or
Deal with some amount of summary copyright infringement

The choice is painfully simple. Outside of the most die hard copyright extremists, who generally dislike search engines in the first place, most will choose to deal with any infringement that may come up, handling cases as needed down the road.

Still, there has to be an easier way. Outside of copyright and plagiarism issues these sites create, they also generate large amounts of spam, hurt legitimate users of services such as Blogger and generally pollute the Web.

These sites are using our content and distributing it to others in an easy to scrape. They have a responsibility to at least try and prevent the bad guys from taking advantage of it, even if they don’t seem to think so.

(Somewhat) Ideal Solutions

Generally speaking, everyone loves these APIs. I use several Technorati feeds to help me keep track of copyright news in the blogging world and would certainly be hard pressed without them. Those feeds have also helped me discover new blogs, many of which I subscribe to now, and I know of at least a few that have found this site through the same means.

The challenge becomes finding a way to protect these feeds from abuse without affecting their functionality. It sounds like a difficult challenge, but it isn’t impossible.

Here are some possible solutions that at least some of these sites could use:

Easter Eggs – As discussed earlier, copyright Easter eggs can be a great way to prove ownership of a work when many similar ones might exist. However, in this case, it can be useful to prove scraping. All one has to do is insert random feed entries that are fakes into their topic feeds. The post URL will lead to 404 pages and the original entries never existed (404 errors are surprisingly common in these feeds, few real users would even notice). Any site that shows up with those Easter eggs is a scrape and, once the ping for the bogus entry is received, the search engine can block the site from accessing its feeds again.
Identifying Yourself – If you see a site that is clearly scraping from a search engine or another third-party API, it can be tricky to tell which one. Is that a Technorati feed or an Icerocket one? There’s no easy answer. These feeds needs to identify themselves and let the public know where it is from. That way, abuse can be reported to the appropriate site.
Headline Only – The small snippets used by most search engine feeds is completely useless to the average reader. Out of fear of violating copyright, they often use only a few dozen words of the post and it’s a random part of the post surrounding the keyword in question. Either making the feed headline only or using the actual summary of the post, which might not contain the keyword at all but gives a better idea of the article’s topic, would thwart scrapers entirely while still making the feed useful for the rest of us.
Unique Feed – Finally, rather than having every user use the same feed, have every user create their own. The feeds would be identical but could be earmarked by Easter eggs or some other tag. This would make it easy both to report the misuse and shut it down.

Even though all of these steps are fairly simple that could easily implemented without much cost or time, they likely won’t. These search engines and services have shown a general disinterest in protecting the content they have been entrusted with.

This, in turn, has left bloggers and Webmasters to fend for themselves.

Practical Solutions

The problem of third-party scraping isn’t as major to most bloggers as traditional scraping due to the limited amount of content stolen. However, it doesn’t mean that bloggers can’t or shouldn’t take steps to reduce this kind of content theft. If nothing else, helping to stop search engine spam will make the Internet a better place.

Besides, in some cases these third-party feeds can cause a great deal of problems for copyright holders, especially when visual works are involved.

So, until such a time when search engines patrol their own services, here are some tips to reduce content theft via third party feeds.

Watermark All Images – It’s pretty simple. Place a watermark on all images you submit to Flickr and add some a credit slide to all videos you place on Youtube or Google Video. That way, even if they are scraped via a feed, any viewer can easily locate the original source.
Digital Fingerprint Plugin – The Maxpower Digital Fingerprint Plugin, which has become a rapid favorite of mine, may not be entirely useful here since the fingerprint would have to be close to the keyword. However, it can work will help more than FeedBurner can with this type of scraping. Sadly, this is just out of FeedBurner’s control though some of their Feed Flare extensions might also be able to help.
Give Yourself a Byline – Though these feeds usually present the portion around the desired keyword, it still favors the first paragraph of the article, especially on shorter works. It might be worth your time to give yourself a quick byline at the top, so at least some of the scraped sites will present attribution.
Report Scrapers – As I talked about in my previous article, reporting spam sites to their advertisers and their hosts is probably the best way to handle such matters. You can also report the site to the originator of the feed if it can be discovered, they might take an interest in it when directly confronted, but hitting scrapers in the wallet still hurts the most.
Choose Services Carefully – Some search engines are better about this than others. Though most of the big names are a worthwhile trade, others seem only to serve the spammers. Use cloaking and robots.txt to prevent unwanted sites from using your content. If they are generating APIs with your content and are being scraped without offering anything in return, it might be time to opt out.

These aren’t satisfying steps but they are the best ones that Webmasters can reasonably take. No reasonable person is going to advocate shooting yourself in the foot over relatively minor instances of content theft, but it still may be worth taking some extra precautions, especially if you frequently write about topics that are considered spam-friendly (mortgages, financial information, prescription drugs, etc.).

Conclusions

For most bloggers, this will be a rare and relatively minor problem. The keywords most spammers target are pretty narrow in scope. However, bloggers that fall into those fields regularly probably have reason to worry.

Also, photo bloggers, especially those that use Flickr, and video bloggers, especially those on sites like Youtube, have reasons to worry as well. Though the search engine benefit of such content is slight, save perhaps on Google Image Search, the risk of content being separated from attribution is much higher.

None of this is worth obsessing over, but it is worth thinking about and guarding against. If simple steps can protect content without impacting usability, there is no harm in doing so.

It’s just a matter of knowing when to draw the line.

Tags: Attribution, Content Theft, Copyright, Copyright Infringement, Copyright Law, Photography, RSS, Spam, Splogging, Splogs, Technorati, Icerocket, Google