CrowdSourcing Spam Blogging?

Musubi mold
Creative Commons License photo credit: lovelihood

For spam bloggers, or sploggers as they are often known, copyright is one of the most daunting challenges. It only takes one or two copyright complaints to bring down a spam blog network by alerting the host, destroying a significant amount of work. Likewise, a few complaints to advertisers can strip a splogger of a large percentage of their income.

Because of this, splogs have been working on finding ways to feign legitimacy. This helps them both stay online longer as hosts are more reluctant to take them down, helps them better establish a rapport with the search engines, their end goal in most cases, and appeal more to human visitors.

They’ve used many tactics to meet this goal including truncating content use to comply with fair use, spinning content so that it is unrecognizable and even skipping on borrowing content at all and simply using automatic content generation.

However, several readers have drawn my attention to a new kind of spam site, one that, according to their site, gets its readers to submit RSS feeds for inclusion and instead tries to hide behind a veil of user-generated content. This idea of crowd-sourcing spam is a relatively new one to me, one that actually closely mirrors YouTube’s “wild west” early days, but is almost certainly going to upset many bloggers who have had their content used without permission.

The Example

Note: All links to the site have been nofollowed. Please visit those links carefully and note that you do so at your own risk. The links are included purely for demonstration purposes.

TheBlogHub is, by all appearances, a very large and prolific spam blog network. It republishes the full RSS feeds from roughly 50,000 sites without truncation and while hotlinking the original source images.

This includes many of the Web’s most popular blogs including TechCrunch (which appears to be out of date), Mashable and Engadget (Also out of date).

However, according to the site, all the RSS feeds are submitted by users of the service. The exact nature of this service is unclear beyond the site’s mission statement of “to provide quick and easy access to relevant blogs and articles for our guests and members, whilst promoting the respective blogs and their authors.”

But the site does very little to actually promote authors. Not only is the full content used, but the site’s robots.txt file encourages search engines to read the content, thus making it a direct competitor with the original articles and there is no link back to the individual posts, just a small link back to the home page at the top of a site’s content.

The site also accepts comments on its service, which has the potential to further fragment the audience and conversation for the blogs involved.

To make matters even worse, though the site does offer a means to request removal of content, you are required to give some form of verification that it is your content. However, to add a feed into the service there appears to be no such need. The site does, however, offer a means to file a DMCA-like notice buried in their terms of service but the email address bounces mail as undeliverable.

Hosted on Web24, the site appears to be based out of Australia and has ties to an Australian company, other sites of which are advertised heavily on the site.

In short, despite the fact that this site proudly proclaims not to be a spam blog network, it at the very least bears all the signs of being as such. If its goal is to truly be a legitimate service, is has many steps that it should take to be more cooperative with the original authors.

Note: An attempt to email the creators of the site via both the listed email address and the address listed to receive DMCA complaints were returned as undeliverable.

Why This is a Problem

If these sites are truly crowdsourcing the locating and addition of RSS feeds, which is up for debate, it can create challenges for content creators whose works are being reused without permission.

First, in some cases, the sites may qualify for safe harbor. If the content is actually provided by the direction of users and they can show they did not profit directly from the infringement, they may be able to claim safe harbor. However, this is heavily muddled by the Grokster ruling which holds companies can be held liable for “inducing” copyright infringement. However, this only applies to the U.S. and the issue becomes further muddled when other nations become involved.

Second, hosts will be much less likely to take down such sites if they seem legitimate. Instead, they will more likely pass on any infringement notices to the owners of the site, allowing them the chance to remove it and continue on with the other content.

Finally, content creators will be more inclined to treat these sites as legitimate and contact the owners directly, if possible, to resolve these matters. Even if the site is intentionally or tacitly encouraging infringement and benefiting from it, copyright holders will treat them as if they were other legitimate hosts.

The problem with all of this, however, is that it seems unlikely to me that users would, willingly, crowdsource a spam blog network. Contributing RSS feeds to a service for “centralization” seems like an unlikely service to attract thousands of visitors. Instead, it seems to me much more likely that these sites merely attempt to give the appearance of legitimacy by feigning as if the content is submitted from 3rd parties.

Bottom Line

Still, I have no way of knowing with any certainty what is going on in this particular case. But whether they are actually receiving the feeds from users who are agreeing to their terms of service or simply pretending, the result is the same, scraped content from many thousands of sites, the majority of which almost certainly never gave permission.

A spam blog is a spam blog. Whether it is created intentionally, through recklessness or even simple mistake, the outcome is the same.

As such, the spam blogs need to be dealt with accordingly. Though contacting the owner might be best in cases where it seems to be a simple mistake, such as with an RSS reader that was accidentally exposed to the broader Web, in other cases it is most likely best to go with the hosts or advertisers if possible.

Though I typically encourage people to try and sort disagreements over copyright face-to-face. However, with spammers it is usually a waste of time. As with the case in this site, two letters seeking comment bounced back, including one to the email address supposedly set up to receive notices of copyright infringement.

If your feed is republished on the above site, for example, and you want it removed. You would likely be better off reaching out to their host, especially since all of their contact addresses no longer work.

If you enjoyed this post, please consider sharing it with your friends. Also, you can subscribe to the RSS feed or sign up for our email newsletter below:
Join The Plagiarism Today Mailing List

Facebook Comments

Lace Llanora says:

Thanks for writing this! I've emailed the SPAM BLOG, Bloghub and no response. They have been copying my fashion blog and there is no way I could stop them.

I never registered for their service or to be a part of their website.

Lace of http://styleandrelax.net

Lukman Surani says:

hi, thanks for the write up. Is there any code we can put in to block thebloghub from copying our content?
What I’d suggest to all content creators is let’s create a blog posts titled
BOYCOTT thebloghub, as it a spam blogger. Please go direct to our blogsite at …. . Support original content creation.

el the man

Claudyar6 says:

Ty again Jonathan. I received an answer from the Blogger/Google, and they says they could do nothing, cause it´s not a site at Google system (don´t understand this….). Well, i´m waiting. Tks. xo

As mentioned in the article, I would try contacting their host directly using the information here: http://www.web24.com.au/contact/133/contact_web…..

claudya says:

Sorry for bein boring, my url of Gothicbox is http://gothicbox.blogspot.com/. Ty very much.

claudya says:

Jonathan please, did you talk to Web24??? I write to Google DMCA at Jun/10/2010, but till now, have no anwser from them. I discovered my other blog http://blog-memories2.blogspot.com is there too. I don´t give any permission for this site to post my feeds. And worst, there´s option for comments of posts there!!!!!! I´m thinking on to move on for WP, but i´m lost…. I have the blogs to enjoy, and it´s make me very sad. Any tips of what i could do? Tia and xo from Brazil.

I would report them to their host, Web24, linked in the article, and see if they'll handle it. They might not but it's a good start…

claudya says:

They´re stealling feeds of my posts of Gothicbox!!!!! I´m at Brazil, & don´t know what to do!!!! My blog is at BlogCatalog. I think it´s from there, could be???? only see this today at 06/10/10. Help me guys, please?!!!!!

If you're interested, send me an email and we may be able to find a way to shut these guys down. Just provide me the URLs that you object to.

I'm having a time with these guys too . They're running a feed from my blog, using an old title that I haven't used in three years (don't even know where they got that from). To add insult insult to injury they're also hotlinking to images on my server. I've sent them three notices via their contact form to no avail.

techandlife says:

Interestingly I am registered with Blog Catalog like Kate, but I notice my latest post there is from Feb 20th. TheBlogHub scraped 10 posts from my home page the day after my latest post on March 25th and didnt wait for Blog Catalog to update.Perhaps TheBlogHub just gets the initial blog details from Blog Catalog and adds that to their own crawling list.

If it were an issue of TechCrunch clout, I'd wager the 31 articles would be gone. No, I think Kate is on to something, that they are pulling from Blog Catalog and the feed they had in that case went dead. I'm going to check this out further!

That's an interesting point. I'll have to look deeper into that. I wonder if this might also explain why the feeds are updated on a hit or miss basis. Could be interesting.

The problem is that no system is going to be 100% perfect. Yes, Google can and needs to do better, especially with point 1, but spammers game the system not by exploiting flaws but by playing to exceptions. If you shoot out enough sites, a few are going to rank well.I don't know how much defense there can be for that, but I still agree with the crux of your argument, that Google needs to do more.

techandlife says:

Had a look through some of the 3261 (!!!) tech blogs in TheBlogHub's directory listing and noticed that TechCrunch is listed. TheBlogHub scraped 31 of their articles up to June 2009 but interestingly nothing since then. I wonder if TechCrunch had sufficient clout to stop the plagiarism but weren't able to have their content removed or bring the blog down. Wonder if they tried.

Kate says:

I had a look at the Blog Hub site and discovered they have my blog listed there – I didn't add it – but no content. The interesting thing is the description they're using, it's not the one used on my blog, it's taken from Blogcatalog. BC is the only place I've listed my blog using that exact phrase. As far as I know BC are reputable, so I'm guessing these sploggy people are either copying their directory or pulling in the feeds from it.

RN says:

This would all go away if Google would stop rewarding people for stealing content. No one just goes to a spam blog out of the blue. They find their way to spam blogs via the search engines. Spam blogs are set up specifically to rank well in search engines, particularly Google. They also usually come wrapped in Google Ads, too. It is no secret that Google is the worst at identifying and promoting the original source of an article. If two things happened, infringement would be much less of an issue and a lot of the problems would just go away because they would not be financially lucrative.1) The Google Ad department should actively seek and eliminate all sites that do not produce ORIGINAL content from its network.2) Google should fix its method of ranking sites in its search results so that the original source ALWAYS ranks above all of the copies. If Google is going to filter duplicates, then filter the duplicates, not the originals!