Researchers at Microsoft have developed a new technique to battle search engine spammers including sploggers and comment spammers.
The new system, which effectively turns the search engines against the spammer, is promised to be a huge step forward in the war between spammers and search engines.
Despite this, lingering questions about the potential effectiveness of the system have been raised and what its actual impact on the splogger battle will be remains unclear.
A Brief Overview
The new system, entitled "Strider Search Defender", is designed to get past the old ways serach engines would detect and filter out spam.
Previously, search engines would detect spam based almost solely on the content of the page. This was part of the reason many spammers turned to scraping the content from legitimate sites as it was almost guaranteed to pass through those filters. That, in turn, has caused many of the copyright and plagiarism headaches that Webmasters have been dealing with in recent months and years.
With Strider Search Defender, Microsoft takes a group of known spam URLs as a "seed" and then tracks the sites that link to them. It then checks those sites for other spam links and then repeats the process with them. This, eventually, enables them to spot forums and blogs that are frequently being spammed and turn them into "honeyforums" that they can use to detect future spam operations.
According to Microsoft, they will then check the list of possible spam links and cleanse out false positives using another tool entitled "Strider URL Tracer" that will check the outgoing links on the suspected spam site for spam-like behavior including doorway pages and non-legitimate ad syndicators.
The end result, theoretically, will be the nuking of not just an individual spam site, but an entire spam network. Making the removal of spam sites a nearly automatic process and greatly speeding up the detetion of new spam networks.
Criticisms and Concerns
While there's little doubt that the system is a huge leap forward for search engines and will greatly help them combat the most common black hat tactics, there are some concerns and problems that come with the system.
First, while Microsoft is working to reduce the problem of false positives, it's easy to see how they could occur. If a forum or a blog gets defined as being a spammer haven, any legitimate user that posts a comment or reply to it will have his or her site run through their system. If their safeguards fail to catch the site as being legitimate, the site will pegged as spam and be treated accordingly.
Worse still, as Microsoft admits, spammers often throw legitimate links in with their spam links when posting forum, comment and blog spam. This means that simply staying away from sites that are frequently spammed is not enough. They can be put into the "possible spam" pile just by simply being an unwitting victim of a spammer's bid to appear more legitimate.
Also, at some point, the system will still likely rely on traditional spam detection techniques. As spammers set up new networks and find new methods and sites to push their junk links on, they might find themselves out of reach of the system. The system will have to evolve and, most likely, re-seed from time to time in order to keep up with the rapid shifts in the area.
Finally, as with any new advancement in the war against spam, there is the notion that it is nothing more than an escalation in the game of cat and mouse. Simply put, as the system begins to be rolled out, Microsoft is still working on automating the whole process, there is an understanding that spammers will immediately begin working on ways to defeat it and will almost certainly succeed, at least in some capacity.
Still, there is little doubt that this is a major leap forward in the bid for search engines to purge their rankings and I am particularly impressed with the quote from Microsoft Lead researcher Yi-Min Wang, "In the end it is all about protecting the search engines. Because if the spam doesn't show up in any search engine result the spammer will not receive traffic."
If nothing else, Microsoft seems to understand the problem and is actively working toward fixing it.
Impact on Copyright Holders
The hope is that the Strider Search Defender will make Web spamming more difficult and less profitable. That, theoretically, would mean fewer spammers, which in turn would mean less scraping. This could be a huge win for creators of legitimate content, if the system works correctly and doesn't ensnare original and useful content.
However, that remains completely unproven and we honestly won't know what will happen until the system goes live. Furthermore, with many Black Hats already frowing on scraping due to potential search engine penalties and copyright issues, scraping is likely tapering off with only the newest and least-skilled spammers turning to it.
In the end, it might be difficult for copyright holders to tell if scraping is dying off because of chnages to the search engines, or changes in the way black hats generate the content they need. Either way, it looks likely that scraping will start to taper off in the near future.
Of course, the one thing that has been consistent with this battle has been rapid change. In that regard, this Microsoft announcement is no exception.
The original Microsoft paper is a great read on spamming and splogigng. It discusses the issue in depth, in very plain language, and reaches many of the same conclusions myself and others have reached before. It even has a great diagram for showing how search engine spamming works, including how splogging and comment spamming go hand-in-hand.
However, their research also found some very interesting facts that I was not aware of before reading the article.
- Of the 17,000 blogspot.com spam blogs they looked at, nearly half were doorways for one of the top 25 target-page domains.
- Over 95% of the spam blogs analyzed at blog4ever were from the same person.
- The paper links to a partial list of all blog spam commercial software applications. Though incomplete, it lists nine different apps ranging in price from about $70 to $300.
- Though the paper talks at length about blogspot.com spam blogs and does similar comparisons on other domains, it makes no mention of the splogging problem on Yahoo 360 or their own MSN Spaces Service. This is especially odd since most splogs I see in my Technorati watchlists are on one of the two services.
While I personally I have little love for Microsoft, there is little doubt that this is an interestin paper with some very interesting facts and a great deal of hope.
It will be very exciting to see how things turn out.
[tags]Plagiarism, Content Theft, Copyright Infringement, Spam Blogs, Splogs, Splogging, Microsoft, Google, Blogspot, Blogging, SEO[/tags]