Search Engines: Three to Beware

Jonathan BaileyApril 17, 2008

6 minutes read

Ever since Google and Yahoo! set the gold standard for how search engines use other people’s content, there have been other sites and services that have sought to push the envelope.

But following the controversy over the RSS aggregator Shyftr, it is worthwhile to take a look at three other sites that are making widespread use of blog content, how they are doing so and what the potential implications are.

After all, Shyftr is not the only site republishing your feed in full or large part, it is only one of several. Some of which may raise even more red flags than the original Shyftr itself did.

BlogDimension

Background: BlogDimension calls itself “The Web 2.0 search engine”. It is fundamentally a blog search engine that functions similar to a Technorati or an Icerocket. Users search for terms and are delivered a list of relevant blog entries. Users can also search for audio and video.

Why it is Controversial: In addition to offering standard search results, the site offers a “cached” link that displays the full feed as it is scraped from the RSS entry. The scraped entries are available to the search engines, there is no robots.txt, the cached link is not nofollowed and the page itself is not marked “noindex”, and are served with many different ads.

To make matters worse, the cached pages do not follow the standard of a clickable title, instead offering only a very small attribution in the footer, and they modify the original articles by stripping out images/ads and other formatting.

All in all, the site manages to avoid doing nearly all of the things that made Google Cache a fair use and does not offer a clear opt-out system or a way to prevent caching.

Blogdimension’s Response:Blogdimenion addressed many of these issues in a recent blog entry detailing a conflict they had had with another blogger.

They start off by saying that “The story is very simple. There is a new trend from some bloggers to treat search engines as thieves.” This seems unduly callous considering that it is their search engine that violates standard industry practices and goes against recent legal rulings.

However, they do go on to say that they are working to address the bug that causes “nocache” to not work and are are also considering only caching short portions of the original entries. Considering that there are currently no standards for designating “nocache” in RSS feeds, especially in cases where the feed is hosted off-site, that may be the best approach.

My Opinion: Blogdimension pushes the envelope significantly both legally and ethically. It crosses lines that a search engine should not cross and goes beyond being just a tool for locating information and, instead, is also trying to profit from the work of bloggers by displaying ads next to full content.

With a few modifications, Blogdimension could be a decent search engine. But as a site that is hosted in the EU, namely France, it is under the gun of even more stringent notice and takedown provisions than here in the U.S.

Fixing the service to both be a good neighbor and fit more comfortably with industry practices should be simple, but it is of critical importance.

Gallery

URLFan.com

Background: URLFan descibes itself as “an evolving experiment designed to discover what websites the blogosphere is discussing all in real time.” It is fundamentally a blog search engine that allows users put in a domain and find what other sites are saying about it. It also has a more traditional blog search engine, this one powered by Google.

Why it is Controversial: URLFan displays the full content of the feeds it parses in at least two different places.

First, when performing a search for URL, hovering over the “More” link below search search result will in the site displaying the fully entry, complete with images. It is unclear whether or not this content is indexed in the search engines.

However, the second location, by clicking the “Analysis” link below the title of the post, one receives a fully copy of the entry on a page that offers a brief breakdown of the material in the post. This content is definitely indexed in the search engines, a point proved by URLFan’s own Google-powered search engine.

Also, in both locations, the images are hotlinked off the original server and the content is surrounded by URLFan’s ads.

Much like with Blogdimension, the site pushes boundaries legally and avoids many of the safeguards that helped Google win their case.

URLFan’s Response: A sent a query into URLFan over 24 hours ago and have not received a reply. I will update this post should they respond.

My Opinion: URLFan seems to be very reckless with the way it displays blogger content. Where Blogdimension is trying to create a caching service similar to Google’s, URLFan seems to be finding all the ways it can to include the full content from your feed next to their ads.

It is unclear what the actual purpose of this site would be though, other than performing occasional vanity searches. The site itself seems to have several bugs (I seem to be my only referrer for the most part) and its best feature is powered by Google itself.

This brings URLFan dangerously close to “spam blog” territory. This site would need some major changes before I would ever be comfortable using or recommending it.

Today.com

Background: Today.com is a blog search engine and directory that also provides free blog hosting.

Why it is Controversial: Today.com is very careful to only display partial content on the site itself. It includes snippets in both its results and in its “preview” pages.

The controversy, instead, comes from the fact that Today.com allows users to comment on posts, albeit the shortened version of them. With the recent controversies over fractured conversations, it may raise some ire.

However, what is likely to get more attention is that, when a user clicks out of the site, the site they go to is then displayed in a frame with Today.com information at the top of the page.

The latter practice, especially, runs counter to what most ethical sites do and many bloggers remain very upset about sites that display them in a frame, even a small onesuch as this.

Today.com’s Response: A sent a query into Today.com over 24 hours ago and have not received a reply. I will update this post should they respond.

My Opinion: While I’m certainly not happy that Today.com feels compelled to frame all outbound links, they do not raise the same issues as the others on the list.

Likewise, the fragmented conversations problem doesn’t worry me greatly. Not only is not an actual copyright issue, but I could not find a single post with a comment to it.

Today.com is not a site that I will use personally. but it doesn’t not raise as many alarm bells as other either.

All of the issues with the site would be very easy to fix and would likely only require a few minor adjustments.

Conclusions

These days, almost anyone can make the next great Web service or killer app. All one needs it the knowledge, some time and a server.

However, in doing so, it is important to have an understanding of the legal, ethical and industry guidelines in what you are doing. In many of these cases, one gets the impression that they did not run what they were doing by an attorney or allow someone else to offer input. In the case of Blogdimension, they seemed almost hostile to the input that they did get.

It is easy when you are thinking about what would be neat or a great feature to forget about the people whose content you are using. However, any new application or search engine needs to think about content creators as well as users.

Unfortunately, there is much more to creating an application than just writing and testing code. There are many other considerations to weigh and many developers don’t seem ready to deal with them all.

Hopefully, developers will get more savvy about these issues and avoid these kinds of simple mistakes in the future. Otherwise, we can expect many more flame ups like the one over Shyftr.