Blind RSS scraping is obsolete.
For while I wasn’t shocked to find that a splogger was making illegal use of my feed, I was definitely stunned to find out that this particular splogger wasn’t just blindly scraping my content, but rather, was intelligently scrapign only what served his use and tagging it both to organize it and make his scraped content more potent than ever.
It quickly became clear that we are entering a new age of splogging and content scraping, one where the sploggers themselves become a member of Web 2.0
The Problems with Blind Scraping
Blind RSS scraping may be an easy way to generate a large volume of ill-gotten content, but it’s far from perfect.
First off, bloggers tend to wonder off topic pretty regularly (including myself). While this can make the blogging experience more interesting, it’s very frustrating to sploggers looking to hone in on keywords for the benefit of search engines.
Second, blogs, including good ones that have been around for some time have a way of (gulp) dying off. This, of course, kills any benefit to the scraper that he might have obtained from reusing the feed.
Finally, bloggers have caught on to the act of RSS scraping and, though many have taken a very ambivalent attitude towards it, most get quite upset when they see their entire feed reused for commercial gain without their permission.
That’s why, even though splogging, in all of its forms, was growing faster than ever, it needed to evolve. Article generators and microscrapers helped those who prefer to automatically produce content evolve and now smart scrapers are helping the plagiarist sploggers progress as well.
The idea behind smart scraping is actually remarkably simple.
Rather than scrape one feed and hope for the best, the smart scraper actually looks at dozens of feeds, perhaps hundreds or even thousands, and takes only the articles that best suit their purpose.
The simplest way to do this is to simply scan all of the entries for the target keyword and keep only the entries that contain it. Other methods can include fuzzy searching, where posts in a specific field are taken, or using the bloggers own category or tagging system to determine which posts to take.
For example, if you had an RSS feed in a smart scraper database that was looking for the keyword “Low Interest Mortgages”, an entry about getting a loan for your new house might get scraped whereas none of your other entries are touched.
No matter the method though, what this means is that not every post you put out will be scraped, only those that the system determines to be of value. This helps the splogger avoid copyright troubles since these generated blogs can appear to be legitimately reusing content, especially since some even offer a link back, and bloggers are much less likely to discover just a handful of their posts being stolen.
This also helps the splogger create a semi-unique, keyword-dense blog that is ripe for the search engines. What’s more, by simply searching for different phrases, the splogger can actually create dozens of different blogs, all with the same potency.
Needless to say, this is a very powerful technique that can create a very large number of very potent splogs. Copyright holders, especially bloggers, have a lot to be worried about.
What Can Be Done
Fortunately, it appears that Feedburner’s new Uncommon Uses feature can track these sploggers pretty easily. Since they still rely upon RSS feeds in order to obtain their content, Feedburner and similar go-between services at least have the potential to detect such uses.
However, those of us that use Creative Commons Licenses will have to be more wary. Where scraping an entire feed for commercial gain might violate the intent of the license, one or two articles, even if the process is automatic, most likely won’t.
In the end, it’s going to come down to bloggers being aware of the problem and looking out for one another. The real danger in this kind of splogging is that it spreads the theft around so that no one Webmaster loses too much. If we look out for one another and take these sites seriously, even if they don’t dramatically impact our own work, they can be shut down.
Nonetheless, the war is changing rapidly and we have to be ready. Sploggers are not searching, sorting and tagging with the best of the bloggers, creating a new threat both to copyright and to search engine results.