Why Blog Searching Fails

Blog searching has been something of the holy grail for the new Web. Countless companies and search engines have worked to tap the endless stream of information that is the blogging world to deliver useful results in near-real time.

Bloggers, in turn, have used these tools to keep on top of the various topics, check for misuse of their content, seek out related posts and look for other sites referencing them.

Many even use the tools without realizing. WordPress users, for example, use blog search every time they load their dashboard and see their “incoming links”. Others use it when they embed Technorati Tags into their posts.

However, blog searching is not working and it is getting worse. Where it once returned respectable results, today it throws out far more noise than signal. This has made it almost impossible to use the blog search for almost any practical use, making the only effective use of the technology to search for content theft and, even then, only in certain situations.

Unless something is done to fix blog search, it is only a matter of time before the technology is left completely by the wayside.

My History

When I first started Plagiarism Today nearly three years ago, I started using blog search extensively. I subscribed to a series of related Technorati watchlists and used my RSS reader to keep track of the latest happenings in the field of copyright and content theft. As time went on, I added RSS feeds form other search engines including Icerocket and Google Blog Search.

Initially the system worked pretty well. Though there were some garbage results, for the most part every item in my RSS reader warranted opening in a browser. However, over time, the amount of noise grew at a much faster rate than the number legitimate posts, eventually outpacing it three or four times over.

Now, I feel as if I am inundated by these watchlists, getting dozens of results per hour, but only a fraction of which are actually original, human-written articles.

As a result, I spend as much time per day filtering out the junk from my feeds as I do responding to and bookmarking legitimate articles.

The system has become the model of inefficiency and is borderline useless. However, looking through my feeds, I think I know where much of the noise is coming from and why this problem has crept up on us.

Issues With Blog Search

When performing an autopsy on my blog search results recently, I noticed that there were six types of posts that were cluttering up my feeds and suffocating the original content.

  1. Spam Blogs: Splogs are an easy target to blame for blog search clutter but they are not the worst players, truth be told. Since most simply parrot posts already available elsewhere in the feed, they are easy to ignore. Still, the nature of spam blogs pretty much ensures that every post I want to read is repeated two or three more times over the coming hours and days.
  2. Non-Blogs: When blog searching started, pretty much only blogs had RSS feeds. Today, RSS feeds are included in forums, twitter pages, social networking profiles and more. Sadly, all of these things are routinely showing up in blog search results. While often relevant in terms of keywords, they are difficult to comment on and rarely provide any news.
  3. Comment Feeds: It seems that none of the blog search engines have been good at recognizing the difference between the main feed and the comment feed. I routinely see “Comment On” posts in my RSS reader. Typically they are posts I’ve already posted a comment to or the post is very old and is simply receiving a trickle of comments or trackbacks.
  4. Old Posts: It is a strange but increasingly common issue where the results that get returned are anything but timely. I regularly get posts marked as “new” that are really several weeks old. This morning, for example, one of my blog search feeds had two posts talking about the candidated “fighting for Texas” and reporting the allegations of plagiarism against Oboma as new. The posts were dated in late February.
  5. Repeated Posts: I have RSS feeds from three of the major blog search engines so I fully expect to see the same post a few times on the different feeds. However, I regularly see the same post repeated on the same feed, sometimes even days or weeks later. Many times I click on a post, thinking it is new, and am stunned that I commented on it weeks ago and a quick check shows that it showed up on the same feed when it was brand new.
  6. Foreign Language Posts: Finally, even though my search terms are in English, I regularly get results where the bulk of the post is in another language. Though these blogs are often merely spam blogs using automatic translation, they are still of little use to me as I can’t read them, at least not without some translation, which none of the blog search engines seem to provide.

This is not to say that all of the search engines suffer from the problems above equally. Google Blog Search, for example, is inundated with spam blogs and non-blog results, Technorati is worst about repeating posts and delivering content too late while Icerocket seems to return a large number of foreign language posts.

Of course, these aren’t the only things that have been limiting the usefulness of my blog search tools, but they are five of the biggest players to be certain. If I were going to start fixing blog search, these are the places I would start.

Stinky Swiss Cheese

Of course, the problem blog search isn’t just that its results are cluttered. I’d be willing to deal with a large amount of clutter in order to stay on top of the relevant news. Unfortunately, it is a regular occurrence that I miss stories, including those that the feed should have picked up.

I can offer no explanation for why this has happened, many of the sites are available if you perform a manual search so there should be no reason they are missed.

Fortunately, since I do use mulitple search engines, the problem is not as severe as if I had used only one. But even with all of these layers, some sites do not show up and the problem baffles me.

Of the three I use regularly, Icerocket seems to be the worst at picking up all of the articles I need, even regularly missing updates from Plagiarism Today, but the problem, most of the time, seems somewhat random and hard to pin on one or two engines.

Effects on Content Theft

When it comes to detecting content theft, many of the most popular tools, including Copyfeed and the Digital Fingerprint Plugin, both use blog search engines to detect misuse of the RSS feed.

In that regard, the fact that spam blogs and duplicate posts routinely show up in blog search engines is a good thing. It means that the detection is more reliable and that the odds of the bad guys showing up is good.

Unfortunately, the fact that they can reliably appear in the results also encourages spam blogging as an activity. Though blog search engines are not the ultimate target of sploggers, the fact that junk content routinely gets indexed in them does not discourage spammers. This means that, while we will see a large percentage of the scrapers, it means that there will be many more of them.

Furthermore, with the other issues raised with blog search engines, it appears that we’ll be stumbling through the clutter far more than we’ll be battling the bad guys. This could, in the long run, actually hinder our ability to locate and deal with spam bloggers, by not only throwing up excesses of noise, but making it impossible to target the ones that pose the greatest threat.

Conclusions

What is striking is how different the blog search results are compared to the traditional search engines. Though Google and Yahoo! may take a few more days to pick up their results, they are almost completely free of spam blogs and seem, overall, to present relevant content in an organized manner.

Where Google Blog Search is a dismal failure, the main Google search engine is a triumph. This shows not just the different challenges each search engine faces, but a lack of focus and effort on creating the best blog search results possible.

It is clear that blog search has not been given the attention it needs to thrive. The companies involved, even those who have blog search as their sole business, have decided that it is not worth dedicating the resources toward fixing these issues.

Sadly, these are not new problems that only unveiled themselves recently, they have been ongoing issues that have simply grown larger and larger to the point that they now drag the entire system down. Where once they were minor nuisances begging to be nipped in the bud, now they are huge problems that feel almost too big to tackle.

Sadly, it is only a matter of time before blog search becomes completely useless, if it isn’t there already, and it is a problem that many of us have seen coming for a very long time.

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free