Modified Scraping on the Rise

It appears that Google’s push to handle duplicate content may be having an unintended side effect.

Even though a recent report by Attributor indicates that the search engine has done a terrible job separating originals from copies, the spammers don’t seem to be taking any chances.

Spam bloggers are no longer content on scraping entries and republishing, but are modifying them in a variety of ways. On The Blog Herald, another site I write for, editor Tony Hung wrote about a site that seemed to either be synonymizing or double translating content from the site. On tenforty, blogger Deb wrote about a case where her story was translated into another language before being reposted on a spam Blogspot blog.

What started out as a rare phenomenon is now turning into a regular occurrence. Unfortunately, as the tactics of spammers change, so must the tactics of those who seek to protect their content and this calls for a new look at how we protect our works on the Web.

Background

The technology behind modified scraping has been around for several years. I first wrote about it in December of 2005. Back then the problem was fairly rare and the concept was still somewhat new. However, it seems as if more and more spammers are catching on to the trick and, possibly, that new spam blog networks are cropping up to take advantage of the technique.

The idea is that posting verbatim copied works is dangerous. Not only are you likely to get caught and shut down, but Google and other search engines assign penalties to duplicate content, making it harder to get those search results needed to make spamming profitable.

Since editing content by hand takes too long and defeats the purpose of scraping, spammers have started creating ways of modifying or “spinning” content before reposting it in order to fool the search engines into thinking that spam site is actually both legitimate and original.

To achieve that, they use one of several techniques including, but not limited to, the following:

  1. Synonymized Content: The most basic approach takes an article and swaps out occasional words for synonyms according to a built-in thesaurus. Such a system can actually create hundreds or thousands of articles from a single source by using different combinations of synonyms.
  2. Translated Content: This approach runs the content through an automatic translation program similar to what you find on Babelfish. Though the translations are far from perfect and leave the work in a foreign language, it is usually intelligible by humans and search engines alike. This will likely become a more popular technique as the Web gains more of an audience in non-English-speaking countries and those markets become more valuable for spammers.
  3. Double Translated Content: The same as translated content, but this kind translates the translation back into English. This produces a heavily modified and often unintelligible outcome that bears little resemblance to the original in many cases. This type of scraper is purely theoretically at this time but very likely does exist.

In all of the above cases, the outcome is the same, the scraped article bears little resemblance to the original, making it much more difficult to detect and to stop, both for the victims and for the search engines.

Changing Strategies

The problem for bloggers when dealing with this type of plagiarism is that the typical methods of detection simply don’t work. Copyscape, though drastically improved, will struggle with this kind of scraping as will all other plagiarism checkers that work by looking for verbatim copying. This includes high-end academic solutions.

Even Google Alerts can be thwarted by this if the phrase being searched for is modified in the process of spinning the article. Though Blogwerx is working on a product that can detect synonymized scraping, it is clear that any system to search for this kind of abuse is going to require a great deal of power and, most likely, some expense to the user.

The focus then becomes not abandoning the old ways of detecting plagiarism, but on adding new ways to guard against this threat. Those methods should include the following:

  1. Digital Fingerprinting: I’ve been beating on the digital fingerprinting wardrum for some time but such fingerprints are the most natural defense against modified scraping. Fingerprints, if done right, have no synonyms and no translations. They will remain intact no matter how the article is spun and can easily be searched.
  2. Uncommon Uses: Since FeedBurner doesn’t rely on text detection to determine who is using your feed and where, their tools will remain effective even if the post is modified, that is, so long as FeedBurner’s code is not removed.
  3. Using Names: If you can’t use the digital fingerprint plugin you can create your own by either entering your own fingerprint in the footer of all your posts, editing your RSS template or simply using your name, if very unique, at the top of your works. Like fingerprints, names do not have easy translations or synonyms and are unlikely to be altered. Even better, vanity searches can let you know who else is talking about your work.

In short, it is important to find elements that do not have easy translations or synonyms and focus on searching for those. These methods can, and should, be used in addition to other searching techniques to ensure that more human plagiarists or other kinds of scrapers, such as search engine scrapers, are also detected.

Even though these methods do not provide a perfect defense to modified scraping, it is a step in the right direction.

Conclusions

The good news with this kind of spam blog is that the risk being penalized in the search engines for being a victim of scraping goes down drastically. Since spammers avoid any potential duplicate content penalty, you do too.

However, none of this says that the scrapers won’t target keywords similar to your own and then use your own content to beat you in the results. That type of abuse might, in fact, be more likely than ever considering that Google will also not recognize the spam blog as junk and discard it.

As a result, even if we discount the emotional reasons for fighting plagiarism, there is still a great deal of need to monitor our content and ensure that those who make use of it, do so in an acceptable way.

Unfortunately, that will likely remain a game of cat and mouse for many years to come as the plagiarists and scrapers are rapidly changing their techniques to adapt to new situations. Clearly, we have to do the same.

Fortunately, at this phase, the adaptations are not that difficult but the future remains much less certain.

29 comments
JB
JB

Ben: Thanks for commenting, I appreciate the input!
Robin I have to agree, you win the writing competition hands down but the humor one, eh, that has to go to the other guys. They are MUCH funnier.
On that note, I'm proud to announce the new blog based on PT that will have comedy spinnings of all my posts. Entitled "Lie Copying in the Moment" it will be a laugh riot.

JB
JB

Ben: Thanks for commenting, I appreciate the input!

Robin I have to agree, you win the writing competition hands down but the humor one, eh, that has to go to the other guys. They are MUCH funnier.

On that note, I'm proud to announce the new blog based on PT that will have comedy spinnings of all my posts. Entitled "Lie Copying in the Moment" it will be a laugh riot.

JB
JB

Ben: Thanks for commenting, I appreciate the input! Robin I have to agree, you win the writing competition hands down but the humor one, eh, that has to go to the other guys. They are MUCH funnier. On that note, I'm proud to announce the new blog based on PT that will have comedy spinnings of all my posts. Entitled "Lie Copying in the Moment" it will be a laugh riot.

JB
JB

Ben: Thanks for commenting, I appreciate the input!

Robin I have to agree, you win the writing competition hands down but the humor one, eh, that has to go to the other guys. They are MUCH funnier.

On that note, I'm proud to announce the new blog based on PT that will have comedy spinnings of all my posts. Entitled "Lie Copying in the Moment" it will be a laugh riot.

Ben Maurer
Ben Maurer

Hi,
I'm one of the engineers on reCAPTCHA. We wanted to make it crystal clear to the people who put reCAPTCHA on their site that we use the input of their users to digitize books. We don't actually claim any copyright over the digitization we create -- we make it available under the same terms as the non-digitized content.
Any site you use will have similar terms of service -- if you make a comment on this blog, you agree to post under a "Creative Commons Attribution-NoDerivs 2.5 License". When you browse around a site like Amazon, the navigational data is used for their research into consumer preferences. In our case, we use this work to support the need to preserve history into the digital age.

Ben Maurer
Ben Maurer

Hi,

I'm one of the engineers on reCAPTCHA. We wanted to make it crystal clear to the people who put reCAPTCHA on their site that we use the input of their users to digitize books. We don't actually claim any copyright over the digitization we create -- we make it available under the same terms as the non-digitized content.

Any site you use will have similar terms of service -- if you make a comment on this blog, you agree to post under a "Creative Commons Attribution-NoDerivs 2.5 License". When you browse around a site like Amazon, the navigational data is used for their research into consumer preferences. In our case, we use this work to support the need to preserve history into the digital age.

Ben Maurer
Ben Maurer

Hi, I'm one of the engineers on reCAPTCHA. We wanted to make it crystal clear to the people who put reCAPTCHA on their site that we use the input of their users to digitize books. We don't actually claim any copyright over the digitization we create -- we make it available under the same terms as the non-digitized content. Any site you use will have similar terms of service -- if you make a comment on this blog, you agree to post under a "Creative Commons Attribution-NoDerivs 2.5 License". When you browse around a site like Amazon, the navigational data is used for their research into consumer preferences. In our case, we use this work to support the need to preserve history into the digital age.

JB
JB

Ben: Thanks for commenting, I appreciate the input! Robin I have to agree, you win the writing competition hands down but the humor one, eh, that has to go to the other guys. They are MUCH funnier.On that note, I'm proud to announce the new blog based on PT that will have comedy spinnings of all my posts. Entitled "Lie Copying in the Moment" it will be a laugh riot.

JB
JB

Ben: Thanks for commenting, I appreciate the input! Robin I have to agree, you win the writing competition hands down but the humor one, eh, that has to go to the other guys. They are MUCH funnier.On that note, I'm proud to announce the new blog based on PT that will have comedy spinnings of all my posts. Entitled "Lie Copying in the Moment" it will be a laugh riot.

Ben Maurer
Ben Maurer

Hi,

I'm one of the engineers on reCAPTCHA. We wanted to make it crystal clear to the people who put reCAPTCHA on their site that we use the input of their users to digitize books. We don't actually claim any copyright over the digitization we create -- we make it available under the same terms as the non-digitized content.

Any site you use will have similar terms of service -- if you make a comment on this blog, you agree to post under a "Creative Commons Attribution-NoDerivs 2.5 License". When you browse around a site like Amazon, the navigational data is used for their research into consumer preferences. In our case, we use this work to support the need to preserve history into the digital age.

JB
JB

Albert: The problem with Google's watching for duplicate content is that, as Attributor showed, they aren't doing a very good job with it. Originals consistently rank lower than copies.
It's a good idea to use robots.txt, I admit that I need to be more careful myself but I also use meta tags to prevent problems.
As far as reCAPTCHA goes, I have read the TOS, as I do with everything I use here, and didn't find anything out of the ordinary with it. However, as a user, you don't have anything to fear and here is why.
First, there is no TOS when just using reCAPTCHA as a visitor. You sign no agreement with them. They can hold you to nothing. I am the one who signed the agreement and, unless they add something to the CAPTCHA itself, they can not legally hold you to anything.
Second, transferring copyright, by law, requires a written agreement, I doubt anyone has signed such agreement with reCAPTCHA. I haven't.
Third, any input you put into reCAPTCHA will, almost certainly, be uncopyrightable. A pair of words is not a copyrightable product. It can be trademarked, if used for a business, but not copyrighted.
Finally, given the nature of the work, if any input could be deemed copyrightable, it would seem that the damage would be minimal.
As someone with a hearing impaired brother who has worked closely with both the blind and deaf, I tend to support what reCAPTCHA is doing and the fact that they have very solid spam protection is a great bonus.
If what they are doing bothers you and you feel that they are taking something from you, consider it a donation to a worthy cause, making books readable to the blind.
It seems like a worthy cause to me at least.
As I said though, you as a user don't have anything to fear from reCAPTCHA, at least not that I see. However, I am going to forward your comment to a contact I have there and see if they have any input on this.

JB
JB

Albert: The problem with Google's watching for duplicate content is that, as Attributor showed, they aren't doing a very good job with it. Originals consistently rank lower than copies.

It's a good idea to use robots.txt, I admit that I need to be more careful myself but I also use meta tags to prevent problems.

As far as reCAPTCHA goes, I have read the TOS, as I do with everything I use here, and didn't find anything out of the ordinary with it. However, as a user, you don't have anything to fear and here is why.

First, there is no TOS when just using reCAPTCHA as a visitor. You sign no agreement with them. They can hold you to nothing. I am the one who signed the agreement and, unless they add something to the CAPTCHA itself, they can not legally hold you to anything.

Second, transferring copyright, by law, requires a written agreement, I doubt anyone has signed such agreement with reCAPTCHA. I haven't.

Third, any input you put into reCAPTCHA will, almost certainly, be uncopyrightable. A pair of words is not a copyrightable product. It can be trademarked, if used for a business, but not copyrighted.

Finally, given the nature of the work, if any input could be deemed copyrightable, it would seem that the damage would be minimal.

As someone with a hearing impaired brother who has worked closely with both the blind and deaf, I tend to support what reCAPTCHA is doing and the fact that they have very solid spam protection is a great bonus.

If what they are doing bothers you and you feel that they are taking something from you, consider it a donation to a worthy cause, making books readable to the blind.

It seems like a worthy cause to me at least.

As I said though, you as a user don't have anything to fear from reCAPTCHA, at least not that I see. However, I am going to forward your comment to a contact I have there and see if they have any input on this.

JB
JB

Albert: The problem with Google's watching for duplicate content is that, as Attributor showed, they aren't doing a very good job with it. Originals consistently rank lower than copies. It's a good idea to use robots.txt, I admit that I need to be more careful myself but I also use meta tags to prevent problems. As far as reCAPTCHA goes, I have read the TOS, as I do with everything I use here, and didn't find anything out of the ordinary with it. However, as a user, you don't have anything to fear and here is why. First, there is no TOS when just using reCAPTCHA as a visitor. You sign no agreement with them. They can hold you to nothing. I am the one who signed the agreement and, unless they add something to the CAPTCHA itself, they can not legally hold you to anything. Second, transferring copyright, by law, requires a written agreement, I doubt anyone has signed such agreement with reCAPTCHA. I haven't. Third, any input you put into reCAPTCHA will, almost certainly, be uncopyrightable. A pair of words is not a copyrightable product. It can be trademarked, if used for a business, but not copyrighted. Finally, given the nature of the work, if any input could be deemed copyrightable, it would seem that the damage would be minimal. As someone with a hearing impaired brother who has worked closely with both the blind and deaf, I tend to support what reCAPTCHA is doing and the fact that they have very solid spam protection is a great bonus. If what they are doing bothers you and you feel that they are taking something from you, consider it a donation to a worthy cause, making books readable to the blind. It seems like a worthy cause to me at least. As I said though, you as a user don't have anything to fear from reCAPTCHA, at least not that I see. However, I am going to forward your comment to a contact I have there and see if they have any input on this.

JB
JB

Albert: The problem with Google's watching for duplicate content is that, as Attributor showed, they aren't doing a very good job with it. Originals consistently rank lower than copies.

It's a good idea to use robots.txt, I admit that I need to be more careful myself but I also use meta tags to prevent problems.

As far as reCAPTCHA goes, I have read the TOS, as I do with everything I use here, and didn't find anything out of the ordinary with it. However, as a user, you don't have anything to fear and here is why.

First, there is no TOS when just using reCAPTCHA as a visitor. You sign no agreement with them. They can hold you to nothing. I am the one who signed the agreement and, unless they add something to the CAPTCHA itself, they can not legally hold you to anything.

Second, transferring copyright, by law, requires a written agreement, I doubt anyone has signed such agreement with reCAPTCHA. I haven't.

Third, any input you put into reCAPTCHA will, almost certainly, be uncopyrightable. A pair of words is not a copyrightable product. It can be trademarked, if used for a business, but not copyrighted.

Finally, given the nature of the work, if any input could be deemed copyrightable, it would seem that the damage would be minimal.

As someone with a hearing impaired brother who has worked closely with both the blind and deaf, I tend to support what reCAPTCHA is doing and the fact that they have very solid spam protection is a great bonus.

If what they are doing bothers you and you feel that they are taking something from you, consider it a donation to a worthy cause, making books readable to the blind.

It seems like a worthy cause to me at least.

As I said though, you as a user don't have anything to fear from reCAPTCHA, at least not that I see. However, I am going to forward your comment to a contact I have there and see if they have any input on this.

Albert
Albert

I'm glad Google is watching out for duplicate content. I haven't figured out a way yet to manage access to my RSS feeds so for the time being its easy for other sites to syndicate my content without my approval, even though on my sites I say that practice in not allowed.

To accommodate Google's duplicate content checks, I try and make sure that my robots.txt file tells Google only to check certain pages, and to ignore the rest.

I see you are using the reCaptcha spam defense. Have you read their TOS? A little over the top if you ask me. Note to reCAPTCHA - I do NOT assign the intellectual property created by my answering the reCAPTCHA question.

Albert
Albert

I'm glad Google is watching out for duplicate content. I haven't figured out a way yet to manage access to my RSS feeds so for the time being its easy for other sites to syndicate my content without my approval, even though on my sites I say that practice in not allowed.
To accommodate Google's duplicate content checks, I try and make sure that my robots.txt file tells Google only to check certain pages, and to ignore the rest.
I see you are using the reCaptcha spam defense. Have you read their TOS? A little over the top if you ask me. Note to reCAPTCHA - I do NOT assign the intellectual property created by my answering the reCAPTCHA question.

Albert
Albert

I'm glad Google is watching out for duplicate content. I haven't figured out a way yet to manage access to my RSS feeds so for the time being its easy for other sites to syndicate my content without my approval, even though on my sites I say that practice in not allowed. To accommodate Google's duplicate content checks, I try and make sure that my robots.txt file tells Google only to check certain pages, and to ignore the rest. I see you are using the reCaptcha spam defense. Have you read their TOS? A little over the top if you ask me. Note to reCAPTCHA - I do NOT assign the intellectual property created by my answering the reCAPTCHA question.

Albert
Albert

I'm glad Google is watching out for duplicate content. I haven't figured out a way yet to manage access to my RSS feeds so for the time being its easy for other sites to syndicate my content without my approval, even though on my sites I say that practice in not allowed.

To accommodate Google's duplicate content checks, I try and make sure that my robots.txt file tells Google only to check certain pages, and to ignore the rest.

I see you are using the reCaptcha spam defense. Have you read their TOS? A little over the top if you ask me. Note to reCAPTCHA - I do NOT assign the intellectual property created by my answering the reCAPTCHA question.

Ben Maurer
Ben Maurer

Hi,I'm one of the engineers on reCAPTCHA. We wanted to make it crystal clear to the people who put reCAPTCHA on their site that we use the input of their users to digitize books. We don't actually claim any copyright over the digitization we create -- we make it available under the same terms as the non-digitized content.Any site you use will have similar terms of service -- if you make a comment on this blog, you agree to post under a "Creative Commons Attribution-NoDerivs 2.5 License". When you browse around a site like Amazon, the navigational data is used for their research into consumer preferences. In our case, we use this work to support the need to preserve history into the digital age.

Ben Maurer
Ben Maurer

Hi,I'm one of the engineers on reCAPTCHA. We wanted to make it crystal clear to the people who put reCAPTCHA on their site that we use the input of their users to digitize books. We don't actually claim any copyright over the digitization we create -- we make it available under the same terms as the non-digitized content.Any site you use will have similar terms of service -- if you make a comment on this blog, you agree to post under a "Creative Commons Attribution-NoDerivs 2.5 License". When you browse around a site like Amazon, the navigational data is used for their research into consumer preferences. In our case, we use this work to support the need to preserve history into the digital age.

JB
JB

Albert: The problem with Google's watching for duplicate content is that, as Attributor showed, they aren't doing a very good job with it. Originals consistently rank lower than copies.It's a good idea to use robots.txt, I admit that I need to be more careful myself but I also use meta tags to prevent problems.As far as reCAPTCHA goes, I have read the TOS, as I do with everything I use here, and didn't find anything out of the ordinary with it. However, as a user, you don't have anything to fear and here is why.First, there is no TOS when just using reCAPTCHA as a visitor. You sign no agreement with them. They can hold you to nothing. I am the one who signed the agreement and, unless they add something to the CAPTCHA itself, they can not legally hold you to anything.Second, transferring copyright, by law, requires a written agreement, I doubt anyone has signed such agreement with reCAPTCHA. I haven't.Third, any input you put into reCAPTCHA will, almost certainly, be uncopyrightable. A pair of words is not a copyrightable product. It can be trademarked, if used for a business, but not copyrighted.Finally, given the nature of the work, if any input could be deemed copyrightable, it would seem that the damage would be minimal.As someone with a hearing impaired brother who has worked closely with both the blind and deaf, I tend to support what reCAPTCHA is doing and the fact that they have very solid spam protection is a great bonus.If what they are doing bothers you and you feel that they are taking something from you, consider it a donation to a worthy cause, making books readable to the blind. It seems like a worthy cause to me at least.As I said though, you as a user don't have anything to fear from reCAPTCHA, at least not that I see. However, I am going to forward your comment to a contact I have there and see if they have any input on this.

JB
JB

Albert: The problem with Google's watching for duplicate content is that, as Attributor showed, they aren't doing a very good job with it. Originals consistently rank lower than copies.It's a good idea to use robots.txt, I admit that I need to be more careful myself but I also use meta tags to prevent problems.As far as reCAPTCHA goes, I have read the TOS, as I do with everything I use here, and didn't find anything out of the ordinary with it. However, as a user, you don't have anything to fear and here is why.First, there is no TOS when just using reCAPTCHA as a visitor. You sign no agreement with them. They can hold you to nothing. I am the one who signed the agreement and, unless they add something to the CAPTCHA itself, they can not legally hold you to anything.Second, transferring copyright, by law, requires a written agreement, I doubt anyone has signed such agreement with reCAPTCHA. I haven't.Third, any input you put into reCAPTCHA will, almost certainly, be uncopyrightable. A pair of words is not a copyrightable product. It can be trademarked, if used for a business, but not copyrighted.Finally, given the nature of the work, if any input could be deemed copyrightable, it would seem that the damage would be minimal.As someone with a hearing impaired brother who has worked closely with both the blind and deaf, I tend to support what reCAPTCHA is doing and the fact that they have very solid spam protection is a great bonus.If what they are doing bothers you and you feel that they are taking something from you, consider it a donation to a worthy cause, making books readable to the blind. It seems like a worthy cause to me at least.As I said though, you as a user don't have anything to fear from reCAPTCHA, at least not that I see. However, I am going to forward your comment to a contact I have there and see if they have any input on this.

Albert
Albert

I'm glad Google is watching out for duplicate content. I haven't figured out a way yet to manage access to my RSS feeds so for the time being its easy for other sites to syndicate my content without my approval, even though on my sites I say that practice in not allowed. To accommodate Google's duplicate content checks, I try and make sure that my robots.txt file tells Google only to check certain pages, and to ignore the rest. I see you are using the reCaptcha spam defense. Have you read their TOS? A little over the top if you ask me. Note to reCAPTCHA - I do NOT assign the intellectual property created by my answering the reCAPTCHA question.

Albert
Albert

I'm glad Google is watching out for duplicate content. I haven't figured out a way yet to manage access to my RSS feeds so for the time being its easy for other sites to syndicate my content without my approval, even though on my sites I say that practice in not allowed. To accommodate Google's duplicate content checks, I try and make sure that my robots.txt file tells Google only to check certain pages, and to ignore the rest. I see you are using the reCaptcha spam defense. Have you read their TOS? A little over the top if you ask me. Note to reCAPTCHA - I do NOT assign the intellectual property created by my answering the reCAPTCHA question.

Trackbacks

  1. [...] a recent article on my site, I talked about various techniques for detecting spun versions of your posts. Those tips included [...]

  2. [...] Le terme spinning a été “inventé” par Jonathan Bailey de Plagiarism Today [...]

  3. [...] the article offers more detail on how to protect your site at his site Plagiarism Today. Check out Modified Scraping on the Rise. Lorelle at WordPress also offers comments on What Do You Do When Someone Steals Your [...]

  4. [...] that doesn’t know how to change the template? Is the strange word choice the result of automated spinning or someone learning English? If the spam blog did its job, it can be difficult to [...]

  5. [...] news in all of this is that Webmasters do not have a great deal to fear from spinning spam. Though it has been on the rise for years now, it has proved to be a fairly ineffective form of [...]

  6. [...] of his previous discussions on the subject, Five Years Later: Why RSS Scraping Still is Not OK and Modified Scraping on the Rise, providing a more in-depth look into the [...]