I often get asked by reporters and bloggers alike exactly how bad scraping is on the Web. I discuss my past experiments on the topic and how, depending on your keywords, suspicious traffic starts showing up with the first post.
However, as I was searching for information on IE7 security flaws for another site I’m working on, I ran across something that was truly mind-blowing.
On Google Blogsearch, this result (nofollowed) was one of the first to pop up. One look at it and you can clearly tell that it is a scrape of another post. However, kindly enough, the scraper left information about their source. I followed through on that and was taken to this entry (nofollowed), yet another scraped page.
It was only after following the results link there that I was taken to the original post on the IEBlog.
It is stunning, though not surprising, to think that scraping is so common that scrapers are picking up each other’s blogs. What makes this situation somewhat unique is that we were able to follow the trail since both scraper sites link to their original source. However, it shows the potential for a post to get scraped again and again as its copies get picked up by other spambots.
In short, every feed your work appears on can, and most likely will, be scraped, even if the appearance is unwanted. It may even be possible to piece together much longer chains of scraping, where you end up with a fifth or sixth generation scrape.
In this case, the first feed was most likely a scrape of a search engine feed such as Google Blogsearch or Technorati. The second one is a news site that, it appears, is reposting and redistributing the entire content of feeds in certain places, though stripping formatting in the process.
This gives us yet another reason to get a handle on our RSS feeds and make sure that they don’t fall in the wrong hands to begin with. Though these sites attributed their use, most are not so generous and even attributed scraping can cause problems.
All in all, it is best to be mindful of this problem and respond accordingly.