Why Scrape?

Jonathan BaileyOctober 20, 2006

3 minutes read

Search engine spammers that scrape content from RSS feeds aren’t the elite of the field. They aren’t evil geniuses gaming the system or even decent black hat SEOs, they are simply lazy and cheap.

Simply put, scraped content comes with too many problems to be effective. It’s unrelaible, often of poor quality (especially for keyword density purposes), can land a spammer in copyright troubles and may subject the spammer to duplicate content penalties, severely hurting his rankings. It can also lead to embarassing situations and horrible mistakes.

With so many other means of obtaining content available, only a fool would scrape RSS feeds. However, many still do, though there is some evidence that trend is changing.

Generation Generate

There’s something of a consensus among black hat seos that content generation, using computer programs to automatically produce content, is the way to go.

Content generation, for spam purposes, is faster, limited only by the speed of the computer, more customizable and, with the right tools, undetectable by the search engines. It also avoids sticky copyright issues that can get Adsense accounts cut and prompt hosts to shut down sites.

Best of all, content generation allows the user to completely control the work, adjusting keyword density, sentence length, paragraph length, work length and more.

These are some of the reasons that the now-famous “bad data push” incident, where a search engine spammer got millions, possibly billions, of pages indexed on junk domains, was done without using a single word of scraped content. Instead, it used auto-generated pages and traditional black hat tactics to fool the search engines.

Despite the success of content generation, many continue to scrape content and, along with it, infringe the copyrights of legitimate bloggers.

All of this begs the question, why is anyone scraping at all?

So Why Scrape?

The reason is simple. Scraping is the easiest, least technically advanced way to obtain content. The content is widely available and the applications to scrape it are easy to find and very cheap. Generating content, though much better in the long run, takes a higher up front investment, requiring more work, money and knowledge.

However, as with most things dealing with technology, generating content is getting easier. One site, EssayGenerator.com, enables users, for free, to generate an “essay” by simply pushing in the subject. While these essays aren’t of very high quality to human readers, though one allegedly tricked an educator, they certainly might be appetizing to the search engines.

Of course, EssayGenerator.com essays woudln’t work for search engine spam, not unless the results could be scraped. It would be too slow and, odds are, the search engines can already detect its output as being computer generated. Still, it shows that the cost and difficulty of generating content is going down fast. Other tools that surpass EssayGenerator.com are rapidly dropping in price and improving in usability and effectiveness.

Soon, it will be cheaper and easier to generate than to scrape. Perhaps then scraping will start to loose favor but, sadly, it doesn’t seem likely.

Despite the obvious shortcomings of scraping, including ethical, legal and effectiveness issues, scraping has its supporters, including people financially invested in it, and likely won’t fade any time soon.

Because even if scrapers make up less and less of the black hat seos in the world, the growth in the Internet promises to keep a steady stream of new scrapers coming, even as old ones grow wise, get shut down or otherwise move on.

Conclusions

The reason that this is important is simple. Many people, when their content is being scraped, become intimidated by it. Many don’t do anything not because they feel that the law is against, but because they feel hopelessly outclassed, that there is no way to stop them.

That is far from the truth.

Scrapers are thought of as the lowest rung on the black hat seo ladder and one does not have to be a geek or a legal expert to protect themselves from a scraper, especially since the scraper is rarely either in their own right.

There’s no reason not to fight back, especially with so many tools available to do so. The more people who fight back, the faster scrapers will wisen up and move on.

While we may not be able to stop search engine spam, with so much easy money to be made it’s just too tempting a target, we might be able to protect our hard work.

It’s a small victory, but not one to be underestimated.

Tags: Copyright, Copyright Infringement, Copyright Law, Google, Plagiarism, RSS, Scraping, Splogging, Splogs, Spam, Search Engine, SEO

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free

Jonathan BaileyOctober 20, 2006

3 minutes read

Want to Reuse or Republish this Content?

Follow us