Modified Scraping on the Rise

It appears that Google’s push to handle duplicate content may be having an unintended side effect.

Even though a recent report by Attributor indicates that the search engine has done a terrible job separating originals from copies, the spammers don’t seem to be taking any chances.

Spam bloggers are no longer content on scraping entries and republishing, but are modifying them in a variety of ways. On The Blog Herald, another site I write for, editor Tony Hung wrote about a site that seemed to either be synonymizing or double translating content from the site. On tenforty, blogger Deb wrote about a case where her story was translated into another language before being reposted on a spam Blogspot blog.

What started out as a rare phenomenon is now turning into a regular occurrence. Unfortunately, as the tactics of spammers change, so must the tactics of those who seek to protect their content and this calls for a new look at how we protect our works on the Web.

Background

The technology behind modified scraping has been around for several years. I first wrote about it in December of 2005. Back then the problem was fairly rare and the concept was still somewhat new. However, it seems as if more and more spammers are catching on to the trick and, possibly, that new spam blog networks are cropping up to take advantage of the technique.

The idea is that posting verbatim copied works is dangerous. Not only are you likely to get caught and shut down, but Google and other search engines assign penalties to duplicate content, making it harder to get those search results needed to make spamming profitable.

Since editing content by hand takes too long and defeats the purpose of scraping, spammers have started creating ways of modifying or “spinning” content before reposting it in order to fool the search engines into thinking that spam site is actually both legitimate and original.

To achieve that, they use one of several techniques including, but not limited to, the following:

  1. Synonymized Content: The most basic approach takes an article and swaps out occasional words for synonyms according to a built-in thesaurus. Such a system can actually create hundreds or thousands of articles from a single source by using different combinations of synonyms.
  2. Translated Content: This approach runs the content through an automatic translation program similar to what you find on Babelfish. Though the translations are far from perfect and leave the work in a foreign language, it is usually intelligible by humans and search engines alike. This will likely become a more popular technique as the Web gains more of an audience in non-English-speaking countries and those markets become more valuable for spammers.
  3. Double Translated Content: The same as translated content, but this kind translates the translation back into English. This produces a heavily modified and often unintelligible outcome that bears little resemblance to the original in many cases. This type of scraper is purely theoretically at this time but very likely does exist.

In all of the above cases, the outcome is the same, the scraped article bears little resemblance to the original, making it much more difficult to detect and to stop, both for the victims and for the search engines.

Changing Strategies

The problem for bloggers when dealing with this type of plagiarism is that the typical methods of detection simply don’t work. Copyscape, though drastically improved, will struggle with this kind of scraping as will all other plagiarism checkers that work by looking for verbatim copying. This includes high-end academic solutions.

Even Google Alerts can be thwarted by this if the phrase being searched for is modified in the process of spinning the article. Though Blogwerx is working on a product that can detect synonymized scraping, it is clear that any system to search for this kind of abuse is going to require a great deal of power and, most likely, some expense to the user.

The focus then becomes not abandoning the old ways of detecting plagiarism, but on adding new ways to guard against this threat. Those methods should include the following:

  1. Digital Fingerprinting: I’ve been beating on the digital fingerprinting wardrum for some time but such fingerprints are the most natural defense against modified scraping. Fingerprints, if done right, have no synonyms and no translations. They will remain intact no matter how the article is spun and can easily be searched.
  2. Uncommon Uses: Since FeedBurner doesn’t rely on text detection to determine who is using your feed and where, their tools will remain effective even if the post is modified, that is, so long as FeedBurner’s code is not removed.
  3. Using Names: If you can’t use the digital fingerprint plugin you can create your own by either entering your own fingerprint in the footer of all your posts, editing your RSS template or simply using your name, if very unique, at the top of your works. Like fingerprints, names do not have easy translations or synonyms and are unlikely to be altered. Even better, vanity searches can let you know who else is talking about your work.

In short, it is important to find elements that do not have easy translations or synonyms and focus on searching for those. These methods can, and should, be used in addition to other searching techniques to ensure that more human plagiarists or other kinds of scrapers, such as search engine scrapers, are also detected.

Even though these methods do not provide a perfect defense to modified scraping, it is a step in the right direction.

Conclusions

The good news with this kind of spam blog is that the risk being penalized in the search engines for being a victim of scraping goes down drastically. Since spammers avoid any potential duplicate content penalty, you do too.

However, none of this says that the scrapers won’t target keywords similar to your own and then use your own content to beat you in the results. That type of abuse might, in fact, be more likely than ever considering that Google will also not recognize the spam blog as junk and discard it.

As a result, even if we discount the emotional reasons for fighting plagiarism, there is still a great deal of need to monitor our content and ensure that those who make use of it, do so in an acceptable way.

Unfortunately, that will likely remain a game of cat and mouse for many years to come as the plagiarists and scrapers are rapidly changing their techniques to adapt to new situations. Clearly, we have to do the same.

Fortunately, at this phase, the adaptations are not that difficult but the future remains much less certain.

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free