Are Free Blogs More Likely Scraped?

By Jonathan Bailey • May 28th, 2008 • Category: Articles, Prevention

It has been known for some time that many factors affect the amount of scraping your blog sees. For example, the more popular your blog gets, the more spam friendly your keywords and the more sites you ping out to, the more likely your site is to be scraped.

However, one factor that I had not actively considered, but has come up in recent conversations, was if the service you used could help draw more attention to your site.

Specifically at issue is if using a free blog host, such as WordPress.com, BlogSpot or LiveJournal could draw attention to spammers that specifically targeted those sites. I have seen a handful of spam blog networks that seem to target one blog host specifically, but these occurrences have been seemingly rare.

I would like to get feedback on this issue to see if it is something I should look into more deeply. I would love feedback from those who run both self-hosted and freely-hosted blogs and their experiences with scraping on both.

It would seem logical to me that they would be greater targets for scraping given that it is known freely hosted bloggers have fewer tools for stopping RSS abuse and lack the server control to block scrapers. However, freely hosted blogs would also seem to be less reliable considering there is no penalty for bloggers walking away.

If this seems to be a factor I’ll plan on looking into it more deeply and seeing which services are most at risk and what can be done about it.

I look forward to your thoughts.

Short URL to this Post: http://copybyte.com/z/3m

Jonathan Bailey is The Webmaster and author of Plagiarism Today, which he founded in 2005 as a way to help Webmasters going through content theft problems get accurate information and stay up to date on the rapidly-changing field. He is also a consultant to Webmasters and companies to help them devise practical content protection strategies and develop good copyright policies.
Email this author | All posts by Jonathan Bailey

  • Now this is a very interesting site, Jonathan! I have recently launched a blog (25th January 2009). Everything went well until last weekend. All of a sudden I ended up with 118 nonsensical (well, at least as far as I am concerned) posts and am constantly reminded that I have post waiting to be moderated. Needless to say, bloggin at that specific blog-site has become anything but a pleasure.

    So, this afternoon I am googling for blogsites and visiting to see if any specific site will cater for my needs. Well, lo and behold, I happened on yours without knowing that it is addressing the very problem I'm confronted with presently. (BTW it is "synchronistic" incidents like this that makes me believe more firmly in a benevolent universe :o)

    Okay, I surmised that you will not be able to solve my problem, but it is good to see there is someone taking a keen interest in this plague.

    All the best.
  • If there is anything that I can do, please feel free to email me with the specifics of the case, I might be able to have a look and help resolve the matter.

    Best of luck!
  • @zania -
    Glad that it was able to help!
  • Thanks Jonathan
    I'm just updating my ping list now and taking out pingomatic.
    Then I'm off to feedburner to check on who I chose to ping there.
    As I have many many blogs (not scraper splogs incidentally) it'll take a while!
  • Zania: Try this link to an article I wrote on the Blog Herald: http://www.blogherald.com/2007/09/17/how-to-avo...
  • Thanks for the info Jonathan.
    I'm just going through your posts now to find out more about targetting search engine individually.
    Thanks!
  • Zania: The keyword aspect is something I've talked about studied before. It definitely is one of the biggest factors but you can mitigate against keyword-based scraping by not pinging your site out to broad services and, instead, targeting the search engines individually. Some have found that to be a huge help.

    I would imagine that Typepad though would have much the same problem as WordPress or BlogSpot in this area. Though it is paid, it is still a large collection of blogs hosted on one central service. It might have some protection as it is not as large as some hosts, but the theory is still the same.

    Thank you very much for your input!
  • I have free blogs on blogger and wordpress.com, paid blogs on typepad, and my own hosted blogs on Wordpress and most of them get scraped regularly, with the exception of the Typepad blogs (but one of those gets the posts plagiarised by being re-written somewhere else instead...).
    Personally, it has made little difference where the blog is hosted, but it is all down to the keywords used. My blogs with 'popular' keywords get scraped so often I have given up caring about it. I just use RSS Footer and hope that the scraper doesn't bother to eliminate the link back to the post I put above each blog entry.
    I write adult as well as mainstream blogs and the adult ones are scraped the worst. What makes me laugh is that one blog is for adult webmasters and is not 'adult' in content at all, but because I use an 'adult keyword' in the title, it regularly gets scraped for adult splogs. I hope the readers of the splogs are well hacked off with the scaper by the results!
  • Cybele: Ok, I confiess, I had never heard of Expression Engine. Looking at the site though, it appears pretty cool. Might be something to play with later.

    I don't really know how much of an effect that has though. Since all CMSs create the same RSS feeds, it seems to me that scrapers just grab the content from there and move on. No need to guess URLs a the scrapers just pull those from the feed as well. I guess the question though is if the scrapers favor free blogging services over individual sites due to the convenience of many feeds in one place.

    That being said, having a fringe platform may help some with human plagiarists and hand scrapers. After all, they can't go through item by item and copy/paste as easily.

    Joan: I agree and that is part of what makes gathering such statistics nearly impossible. That's one one of the big reasons I want authors who have a toe in both worlds. I help mostly free bloggers but that makes sense because, as you pointed out, there are more of them.

    MojoMan: I agree that the size and prominence of a site are major factors in how much scraping takes places. But they aren't the only ones to be certain. The content of the site is a big deal, sites that talk about video poker and loans are more likely to be scraped than those that discuss gardening, pinging broad services can make a difference and other things can have a bearing to.

    What I'm trying to see if is perhaps the host you choose is a factor as well. It may not be as important as the size of the site, I doubt that it is, but I know from talking with people in the business that the large hosting sites are regular targets of mass scraping efforts, largely due to all of the content in one place.

    It may not be the biggest factor, but it might be one that has to be weighed.

    Oh, and no worries about picking on Google here, you can't say anything worse than what I have in the past...
  • MojoMan
    I don't think it has anything to do with whether or not a blog is hosted on a free service. It has more to do with whether or not the blog has quality content and most importantly, how visible it is. Scraping hits the biggest sites the most, the smallest sites the least. That's why mid-range sites get hit the hardest by Google's duplicate content filter. They are big enough to be noticed and scraped but too small to outrank the spam networks, which are designed to beat Google's link-based algorithm.

    It would be nice if Google could find a way of protecting mid-range sites, but I am beginning to think that they don't know how to do that within the boundaries that they have set for themselves. The other search engines don't seem to be quite as affected because they rely less on incoming links and maybe a few other key facets that are unique to Google's ranking formula.

    I hate to pick on Google, but given how much of the search market they have, it's where everyone is focused on ranking. You hear about made-for-AdSense scraper sites, not made-for-Yahoo scraper sites.

    If anything, these free site hosts are where a lot of the scraping is hosted, not the other way around.
  • Is it possible that sheer numbers play a part in being selected for scraping? I mean there are probably many more free blogs than otherwise.
  • I think that some platforms are more vulnerable than others to scraping. For instance, the default in WordPress is urls with ID numbers, not generated by the title (or at least it was, it may have changed). With the numbers the scrapers just needed a starting point and could predict what the URLs for all the posts on a blog would be and just fetch them (whether they published an RSS or not).

    I'm guessing the same is true for platforms like blogspot, where feed scrapers need only guess at every ATOM address, whether the blogger actually puts in on their blog or not.

    One of the things I like about being on Express Engine is that it's a more fringy publishing platform, and I think not as big of a target for spam/splog coders.
blog comments powered by Disqus