Cpedia: A Spam Blog Disguised as an Encyclopedia

Jonathan BaileyJune 9, 2010

5 minutes read

Last week, @melebeth introduced me to CPedia, a new “encyclopedia” by the makers of Cuil, a search engine that was initially greeted with much fanfare before seemingly flaming out.

Cpedia is not an encyclopedia in the strictest sense as it is not written by human beings. Unlike traditional encyclopedias, which are written by paid experts, or Wikipedia, which is written largely by volunteers, Cpedia is written automatically from the search result pages creating an automated, and often wildly inaccurate encyclopedia-like page.

For example, I have a Cpedia, page as well as one for this site, though my personal page doesn’t actually say anything about me and the one for PT seems to discuss random people/items only tangentially related to the site.

However, the concern Melebeth approached me with was not just about the accuracy of Cpedia, but about the way it used content from other sources. According to her, the search engine was lifting text directly from third-party sites but not properly quoting or citing it.

So, I delved into Cpedia and found, unfortunately, that her fears were largely founded.

How CPedia Works

The basic idea behind CPedia is that it combs through the search results for a relevant term and tries to build out an encyclopedia entry automatically. The results, visually, are very similar to Wikipedia but the content is generally more jumbled and difficult to read.

Cpedia does attribute the content it uses, but in a very strange way. If you click or hover your mouse over the text within an article, but not the link, you will be given a sidebar that shows the text that’s been quoted and a link to the source in the sidebar. If you click the inline text link, you are instead taken to a references page that then links to the original source.

Usually, the individual copied passages are very short though, sometimes, the word count of the passage approached 100 words, especially when the source package was broken up into multiple parts.

CPedia seems to have a nearly unlimited number of topics covered, likely aided by the fact it is automatically generating results, and has many pages that Wikipedia does not, including one for me.

All in all, CPedia is fairly straightforward but that does not mean it isn’t a problem. In fact, in this case, it means quite the opposite.

Problems With Cpedia

Apart from the questionable accuracy of Cpedia, the entire operation, to me, seems highly suspect. The idea of creating new pages of content using snippets from dozens, even hundreds of other pages seems to be a very poor way to do business.

But even discarding the way the content is created, there are several issues with the attribution issue alone. Consider the following two problems:

Always One Step From Source Link: Whether you hover over the text or click to the references page, you are always one action away from the source link. This means users and other search engines alike are always two steps from the source site even though it would be trivial to make it one.
Lack of Clear Quotes: The entire entry is made up of short verbatim quotes from various sources but it is not clear where the quotes begin and end without hovering over the text. The goal is to make the entire work seem like an original creation, an actual encyclopedia entry, without much in the way of visible quotes, just traditional footnote citations.

However, the bigger problem is actually very simple. There are already many sites that build thousands and thousands of entries using snippets from various other pages. They’re called spam blogs and they use a variety of article generation and spinning technology to build new articles out of hodgepodges of existing ones.

And Cpedia is acting very much like a spam blog. Entries from CPedia are appearing in Google, which currently has about 177,000 entries indexed, and though, Cpedia’s robots.txt disallows the wiki directory, it doesn’t seem to be stopping search engines from indexing the entries.

When you factor all of this together, it becomes clear that Cpedia is acting exactly like a spam blog and less like an encyclopedia. Was the intention? Probably not. But it is how the site is functioning, pumping thousands of pages of poorly-written duplicate content into the major search engines.

If that is not the hallmark of a spam blog, I’m not sure what is.

Making it Stop

To be clear, what Cpedia is doing isn’t, most likely, illegal. Fair use would likely protect their very limited use of the content from each individual source. This is one of the reasons this technique is so common among spam blogs is that it makes them almost immune to copyright disputes as a means of closure.

In short, even though the ethics of Cpedia can be hotly debated, most likely they are on the right side of the law.

That being said, if you want your work removed from Cpedia, all you have to do is remove it from Cuil and that can be done by using robots.txt to block “twiceler”.

Also, you can block the IP range that Cuil uses for crawling, which is also listed on the link above.

It is a fairly simple change to make and one that is relatively easy to make. (Note: I have not and will not make it on PT, I keep my robots.txt open intentionally to help observe various issues, like these).

All in all, though I disagree strongly with what Cpedia is doing, they do have the right to do it. This makes fighting back trickier, but far from impossible.

Bottom Line

What Cpedia is doing, in my opinion, is unethical. They are using quotes from various sites without adequate clarity or attribution. They are pumping thousands of pages of admittedly duplicate content into other search engines and are producing and encyclopedia that, by their own admission, is wildly inaccurate.

Though copyright may not be a viable litigation route, I have to wonder how libel will work in this case as repeating libel is, generally, the same as making the libelous statement. In short, those admitted inaccuracies in Cpedia could, in theory, come back to bite the company at a later date.

Considering that search engine liability in cases of libel is still being settled around the world, Google won such a claim in the UK but republishing this information on your own site and admitting it is inaccurate seems to be opening up new avenues for liability.

Would this be a likely claim against Cuil/Cpedia? Probably not. But only because the audience for the site is so small that it seems unlikely many will care. The fact that Cuil/Cpedia has seen so little success is a big part of why webmasters haven’t noticed the spammy nature of the issue and taken up arms.

To be certain, Cpedia flew under my radar until Melebeth asked me about it. I can imagine it is doing the same for many others right now as well.

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free