Why Scrape?

13

Search engine spammers that scrape content from RSS feeds aren’t the elite of the field. They aren’t evil geniuses gaming the system or even decent black hat SEOs, they are simply lazy and cheap.

Simply put, scraped content comes with too many problems to be effective. It’s unrelaible, often of poor quality (especially for keyword density purposes), can land a spammer in copyright troubles and may subject the spammer to duplicate content penalties, severely hurting his rankings. It can also lead to embarassing situations and horrible mistakes.

With so many other means of obtaining content available, only a fool would scrape RSS feeds. However, many still do, though there is some evidence that trend is changing.

Generation Generate

There’s something of a consensus among black hat seos that content generation, using computer programs to automatically produce content, is the way to go.

Content generation, for spam purposes, is faster, limited only by the speed of the computer, more customizable and, with the right tools, undetectable by the search engines. It also avoids sticky copyright issues that can get Adsense accounts cut and prompt hosts to shut down sites.

Best of all, content generation allows the user to completely control the work, adjusting keyword density, sentence length, paragraph length, work length and more.

These are some of the reasons that the now-famous “bad data push” incident, where a search engine spammer got millions, possibly billions, of pages indexed on junk domains, was done without using a single word of scraped content. Instead, it used auto-generated pages and traditional black hat tactics to fool the search engines.

Despite the success of content generation, many continue to scrape content and, along with it, infringe the copyrights of legitimate bloggers.

All of this begs the question, why is anyone scraping at all?

So Why Scrape?

The reason is simple. Scraping is the easiest, least technically advanced way to obtain content. The content is widely available and the applications to scrape it are easy to find and very cheap. Generating content, though much better in the long run, takes a higher up front investment, requiring more work, money and knowledge.

However, as with most things dealing with technology, generating content is getting easier. One site, EssayGenerator.com, enables users, for free, to generate an “essay” by simply pushing in the subject. While these essays aren’t of very high quality to human readers, though one allegedly tricked an educator, they certainly might be appetizing to the search engines.

Of course, EssayGenerator.com essays woudln’t work for search engine spam, not unless the results could be scraped. It would be too slow and, odds are, the search engines can already detect its output as being computer generated. Still, it shows that the cost and difficulty of generating content is going down fast. Other tools that surpass EssayGenerator.com are rapidly dropping in price and improving in usability and effectiveness.

Soon, it will be cheaper and easier to generate than to scrape. Perhaps then scraping will start to loose favor but, sadly, it doesn’t seem likely.

Despite the obvious shortcomings of scraping, including ethical, legal and effectiveness issues, scraping has its supporters, including people financially invested in it, and likely won’t fade any time soon.

Because even if scrapers make up less and less of the black hat seos in the world, the growth in the Internet promises to keep a steady stream of new scrapers coming, even as old ones grow wise, get shut down or otherwise move on.

Conclusions

The reason that this is important is simple. Many people, when their content is being scraped, become intimidated by it. Many don’t do anything not because they feel that the law is against, but because they feel hopelessly outclassed, that there is no way to stop them.

That is far from the truth.

Scrapers are thought of as the lowest rung on the black hat seo ladder and one does not have to be a geek or a legal expert to protect themselves from a scraper, especially since the scraper is rarely either in their own right.

There’s no reason not to fight back, especially with so many tools available to do so. The more people who fight back, the faster scrapers will wisen up and move on.

While we may not be able to stop search engine spam, with so much easy money to be made it’s just too tempting a target, we might be able to protect our hard work.

It’s a small victory, but not one to be underestimated.

 

Tags:

, , , , , , , , , , ,

Want to Republish this Article? Request Permission Here. It's Free.

Have a Plagiarism Problem?

Need an expert witness, plagiarism analyst or content enforcer?
Check out our Consulting Website

13 COMMENTS

  1. It seems to me that Google is going to have to revisit the Adsense game. With so many millions of pages of generated and scraped content existing solely for the purpose of creating money off of waylayed Internet searchers, advertisers are going to get fed up with increasingly diminishing returns.

    Sooner or later someone is going to come up with an Advertising product / service that only accepts quality sites of real utility.

  2. The problem really isn't setting up a new network that caters only to the best of sites. Many networks are doing that right now. The problem is getting enough of both the advertisers and the publishers to reach critical mass. Google was able to do that easily due to their background, others have struggled.

    I fooled around with Adwords some recently and was AMAZED at how much people pay for clicks through Google. It's insane at times.

    The price on this is going to have to come down, especially with all of the fraud, it's too easy to make a free living off of spamming.

  3. It seems to me that Google is going to have to revisit the Adsense game. With so many millions of pages of generated and scraped content existing solely for the purpose of creating money off of waylayed Internet searchers, advertisers are going to get fed up with increasingly diminishing returns.

    Sooner or later someone is going to come up with an Advertising product / service that only accepts quality sites of real utility.

  4. The problem really isn’t setting up a new network that caters only to the best of sites. Many networks are doing that right now. The problem is getting enough of both the advertisers and the publishers to reach critical mass. Google was able to do that easily due to their background, others have struggled.

    I fooled around with Adwords some recently and was AMAZED at how much people pay for clicks through Google. It’s insane at times.

    The price on this is going to have to come down, especially with all of the fraud, it’s too easy to make a free living off of spamming.

  5. While you really can't control what Google or scraper sites do, you can format your content to provide links back to your site. The quality will be low, but if a prospect stumbles across your content, they may come back to your site on the next visit. Thanks for the ideas, anything that keeps revenue in my pocket is appreciated.

  6. I agree that you can't control it but you can do a great deal to mitigate against it by including digital fingerprints to track it and using the inbound links to point to your own site. Glad you liked the article!

  7. "Well , the view of the passage is totally correct ,your louis vuitton handbags details is really reasonable and you guy give us valuable informative post, I totally agree the standpoint of upstairs. I often surfing on this forum when I m free and I find there are so much good information we can learn in this forum!

    • Sadly, no. The nature of RSS prevents that. You can truncate your feed though that can be defeated too. Your best bet is usually to put in a digital fingerprint to monitor your feed and stop the uses that are impacting your site.

LEAVE A REPLY

Please enter your comment!
Please enter your name here