Is Technorati a Scraper?

Jonathan BaileyJune 19, 2006

7 minutes read

Sploggers, when confronted with allegations of copyright infringement and spamming, often turn to the search engines to excuse their actions. Some think that their theft and spamming is justified as part of a larger war against a perceived evil power, the search engines themselves, while others feel that their actions are no worse than those undertaken by Google, Yahoo and others.

While most find that logic questionable at best, one post along those lines made a very interesting allegation: That Icerocket, the prominent blog search engine owned by Mark Cuban, was using the "rel=nofollow" tag on the links in its result pages, depriving bloggers of any additional search engine benefit.

A quick check of Icerocket's HTML code showed that it did indeed "nofollow" all outbound links in its results. However, further research showed similar behavior in both Technorati and, to a lesser degree, Sphere.

Needless to say, this raises some interesting questions regarding the major blog search engines and how they reuse our content.

A Brief Background

To understand the controversy we have to first look at how traditional search engines, like Google and Yahoo, work and generate results. More importantly, we have to understand the importance of links in the equation.

Search engines, for the purpose of this article, use links in two different ways. First, when a search engine spider visits, or crawls, a page, it uses links to both discover new content and help it rank the pages. Generally speaking, the more sites you have linking to you and the higher quality of those sites, the faster you get indexed and the higher you are ranked.

This can be very important, especially to a new site. Getting picked up quickly and ranked high for a popular term in the search engines can bring a windfall of traffic and, possibly, money. Some unethical search engine gurus spend countless hours thinking of ways to cheat the system and obtain links and rankings they wouldn't otherwise be able to earn.

Many sites, including this one, often link to pages that they do not wish to support by increasing thier ranking in search engines. Thus, they use a special tag called "nofollow" in the link itself that instructs a search engine to not count that link toward their ranking, though most search engines will still actually visit and index the site. Another method involves including a special "Robots" tag and adding a "nofollow" attribute to it, this instructs search engines to not only not count the links, but to not visit and index them.

While this overview is simplistic at best, it does explain why links have become the currency of the Web and the favorite form of attribution. It also explains how blog search engines, when dealing with their results, may not be giving as much as they get.

The Case Against the Search Engines

When blog search engines spit out a results page, it is generally filled with keyword-rich content that is quite literally lifted from the sites it indexes. A quick glance at any major blog search engine shows that every results page contains links, content republished from other sources and almost nothing original from the search engine itself.

While this isn't necessarily bad, most have no problems with search engines using their content as it is very good for traffic and readership, the questions arise when one looks at the source code of the various search engine results.

A quick look at the source code for an Icerocket results page shows that all external links carry a "rel=nofollow" tag in the link itself. All internal links are left untouched. This means that a search engine visitng the page would index the entire page, including all of the content pulled from other blogs. It would then index and count all internal Icerocket links and index, but not count, any external links. This means that Webmasters who provide the bulk of the content on the page, receive nearly no benefit while Icerocket itself, potentially, gains a great deal.

A look at a regular Technorati result reveals something similar, yet very different. At the very top of the code is a very simple line that reads as follows:

<meta name="robots" content="index,nofollow" />

This tells a search engine to go ahead and index the page, which once again contains almost exclusively content from other sites, but to not follow any links on the page, either for indexing or counting.

This means that for every results page that Technorati gets indexed in the search engines, and they currently have nearly 2 million, they get the keyword-rich content provided by Webmasters and Webmasters don't get indexed or counted. Of course, Technorati shoots themselves too by not indexing any links at all, including their local ones.

Finally, looking at the code for a standard results page for upstart Sphere.com exposes a similar tag to Technorati's but one with a notably different outcome. Their tag reads as follows:

<meta name="ROBOTS" content="NOINDEX,NOFOLLOW">

While the tag is the same in that it tells search engines not to count or index any links on the page, it also also tells the search engine not to index the page itself. This means that, should a search engine stumble across the page, it will simply ignore it, neither indexing the content nor following the links.

While Sphere certainly has the fairest solution, one has to wonder why every major blogging search engine has gone to such great lengths to keep other search engines from counting links in their results.

After all, a quick glance at a regular result for either Google or Yahoo shows that neither site has any nofollow tags in their HTML. What makes the blog search engines so different?

A Question of Why

There are many possible reasons why blog search engines would want their results to not be counted in other search engines. Some theorize that search engines penalize sites with large collections of outgoing links, some might say it's a sign of hostility between the blog search engines and the traditional ones, and a few go as far as to say that it's proof the blog search engines are engaging in scraper-like activity.

The truth is that I don't know. I emailed all three of the search engines on the fifteenth and, as of this writing, have not heard back from them. Without any word from them, all I can do is theorize.

However, on that note, I do think I have an idea as to what would make such major sites take such drastic steps.

A Natural Comparison

At the end of the day, the nofollow mystery comes back to the same scrapers and spammers that hurl the accusations against them. In short, without them, it's highly unlikely that Technorati and the other blog search engines would take these steps at all.

An easy comparison deals with a form of spam most bloggers are familiar with, comment spam. Most major blogs and blogging applications automatically add the nofollow tag to any links posted in comments. The reason for this is to prohibit spammers from benefiting from their activities and, hopefully, to discourage the behavior altogether.

While the effectiveness of nofollowing as a deterrent for comment spam is debatable, it still offers some reassurance that comment spammers wiill not be able to get underserved "votes" from the sites that theys pam. However, it also means that legitimate commenters gain no benefit either.

Still, even though the vast majority of comments actually up on a maintained site at any given time are legitimate, the practice is both commonplace and accepted. No commenter, spammer or otherwise, expects to have a link posted without the nofollow tag.

With the abundance of splogs, blog and ping operations and other junk content being produced by various automated means, it makes sense that the blog search engines, the front lines in the war against splogging, would want to discourage the behavior. The natural step, sadly, was to nofollow all of the links.

It's important to remember that sploggers aren't targeting the blog search engines. They use the ping services and blog search engines, theoretically, as a springboard to get into and get ranked high in the major search engines. By reducing that benefit for sploggers, the blog search engines, theoretically, make the blogging community a slightly better place and, hopefully, keep their own results a little cleaner as fewer spammers target it

How this will work out remains to be seen, but it at least provides a logical explanation for seemingly irrational, and even greedy, behavior.

Conclusions

In the end, my main problem with the blog search engines isn't that they nofollow external links but that they are, with the exception of Sphere, willing to let other search engines index the results pages themselves. While Icerocket only has relatively few indexed results pages, well under forty thousand, Technorati clearly has had good luck getting its results indexed and, most likely well-ranked in the major search engines.

It seems strange to me that Technorati and Icerocket both would allow other search engines to index their results pages, along with the content of the bloggers that feed them, without providing at least some reward to those that created the bulk of the content.

It seems that Sphere may have devised the fairest solution of all, preventing all indexing of the results page itself. This not only prevents spammers from taking advantage of the search engine, but also prevents Sphere from benefiting in a one-way relationship with blogger content.

Personally, I have no problem with Technorati's or Icerocket's use of my content. They both have sent a fair amount of traffic my way and have helped my site grow. If I did have such a problem, I would certainly opt out immediately.

While elements of their nofollow policies make me nervous, I realize that, most likely, they are attempting to reduce the amount of trash on the Internet. That is one thing that could work out well for everyone.

[tags]Plagiarism, Content Theft, Copyright Infringement, Icerocket, Technorati, Sphere, Nofollow, SEO, Google, Yahoo[/tags]

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free

Jonathan BaileyJune 19, 2006

7 minutes read

Want to Reuse or Republish this Content?

Follow us