The Changing Face of Search Engine Spam

Jonathan BaileyJune 20, 2006

4 minutes read

Search Engine Spam is in the news like never before. A single spam site, originally discovered by the Digital Point user Nintendo, was estimated by Google to have some five billion Web pages indexed in the search engine. Though, as one Google employee, Adam Lasnik, pointed out, the actual number is very likely off by a great deal (due to what he calls a "bad data push"), there is evidence that the spam was very successful, including a ranking of over 2000 in Alexa and evidence of the spam pages being ranked high in Google's results.

Reaction to the announcement has been mixed. Some are angry with Google for allowing this to happen, others are amazed at the operator's skill in gaming the system so well while others, such as Ana, remain almost completely unimpressed.

Copyright holders, however, need to take note of what happened. The "Bad Data Push" incident, as it's being called, might be indicative of the changing face of search engine spam

If that's the case, it would be bad news for the search engines, especially Google, but potentially very good news for Webmasters worried that their content might be scraped for someone else's gain.

Scraping is Becoming Obsolete

Scrapers have generally been despised by both sides of the search engine optimization (SEO) wars. White hats, who focus on improving rankings for legitimate content, view scraping as coyright infringement and theft of their hard work. Black hats, who focus on tricking search engines to rank their sites higher than it would be otherwise, view scraping as "one of the worst
ways to get content" and something to do when "there is no alternative."

Even as black hats sell software that enables users to scrape content, they belittle the act and talk more about other means of filling Web pages with junk content. Articles about scrapers are surprisingly hard to find on most black hat sites and most tools that have scraping capability bury the feature.

In short, though many tools out there can and do scrape, very few have it as their primary function.

The reason is that, ethical and legal issues aside, scraped content comes with many problems. First, scraped content is, by default, duplicate content found elsewhere and that may hamper search engine rankings. It's also unreliable since blogs and sites shut down regularly, not always keyword rich, not high in quantity, hard to locate (especially for millions of sites) and can produce a very angry backlash that ends the gravy train prematurely.

In short, scraped content is inadequate for people who are serious about gaming the search engines. Automatically generated content is faster, easier, more reliable and less trouble than scraped content.

For those who insist on scraping, there are large article collections (both free and low cost) available for use and tomes of public domain material, all of which come without most of the hassle of scraping unwilling content. These sources, especially when combined with standard synonymzing software, can produce more content, with greater reliably and much less headache than scraping alone ever could.

This is why, in my look ahead to 2006, I said that scraping, most likely, was going to lose favor with the sploggers and other spammers. From the looks of things, I was at least somewhat right.

Countless Pages, None of them Yours

By most accounts, the spam pages in the Bad Data Push incident were not scraped from any traditional, legitimate Website. Most agree that the content was either auto-generated or, at worst, scraped from search engine results.

Judging from the sheer amount of content thrown up and the time frame (less than a month), using scraping for all of the content is almost completely impossible. It simply would have taken too much time to locate and scrape the articles, let alone get them indexed into Google and have them start generating traffic.

Odds are, as most have speculated, the content was probably auto-generated with portions scraped from search engine results, which do contain small elements of other's content, along with a healthy mix of precious ads. The real magic seems to lie in the technology within the server itself which was both able to imitate millions of subdomains, generate content on the fly and fool Google into thinking it was something else.

This simply would not have been possible with traditional RSS or Web site scraping.

There is little doubt that, as black hats all over the world catch wind of the success this one spammer had, many will shy away from scraping in favor of other forms of spamming. This could, theoretically, create a nightmare for the major search engines but does give Webmasters a reason to breathe a sign of relief. After all, their content is less appetizing to a spammer than ever.

Of course, how one will fair in the search results once these new methods become more widespread is a completely different issue.

Is Scraping Dead?

One would hope, upon realizing this, that scraping would die and Webmasters would have a major copyright burden off of their backs. Sadly, that's not likely to be the case.

Black hat SEOs pride themselves on trying new things and taking approaches others don't. Thus, some will always resort to scraping. Even if it loses favor due to proven flaws in the system, it will always be used by some, especially those without much experience or money. Truth be told, scraping hasn't had any real favor with black hat gurus for some time and, as any check of splog search engines will show, vast amounts of scraped content are still being used.

The sad outcome of this is could be that we notice almost no drop off in actual scraping as we find ourselves dealing with more black hats than previous. Though a smaller percent will likely use scraping, it's likely that the number of actual scrapers will stay roughly the same.

Nonetheless, this recent incident does prove that, while scraping can generate some SEO benefit, that greater results can be gleaned from other methods. While all of these techniques are equally unethical from an SEO standpoint and may present problems to Webmasters trying to get their legitimate sites listed high in the search results, at least the fear of having our content stolen can be somewhat relaxed.

Of course, we are just waiting for the next big black hat breakthrough to change the game again. There truly is no telling what tomorrow brings.

[tags]Plagiarism, Content Theft, Copyright Infringement, Copyright Law, Scraping, Spamming, Google, Yahoo, MSN, SEO, black hat, white hat[/tags]

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free

Jonathan BaileyJune 20, 2006

4 minutes read

Want to Reuse or Republish this Content?

Follow us