Transcraping: Multi-Lingual Content Theft

Jonathan BaileyAugust 29, 2007

4 minutes read

When Sallie Goetsch received a Google Alert for her name, she was originally excited. It meant, most likely, that one of her free Web articles had been picked up and used by another site.

However, when she followed the link, she found something else. Rather than a properly attributed article, complete with bio and proper links, she found a scraped copy of it on a spam blog. The attribution tag was missing and her name was only affixed due to the fact it was on the article itself.

But what made the case exceptional was not that her free article had been scraped within an hour of going live, but that it appeared in German and her original work was written in English.

A scraper had not only taken the content, but passed it through an automated translation service, producing both a poor translation of the piece, but also a completely different work in the eyes of the search engines.

Though Goetsch was not the first person to encounter this problem, her case does illustrate the changing face of scraping on the Web. As the Internet becomes more international and the technology used to steal content grows more advanced, this type of content theft can only become more common.

Reasons to Worry

The problem with this type of scraping is that it is almost impossible to detect. Since the text is stripped out for translation, any markers or images placed in the feed to track usage will likely be removed. Second, because the text looks completely different from the original, traditional searching will be to no avail. Finally, since these crude translations are done instantly, they can beat a more effective human translation to the Web by a matter of hours or days.

What it all amounts to is a detection nightmare. To a search engine, the translated work bears no resemblance to the original. No search engine can easily detect this kind of abuse. That works out great for the scraper, who gets what the search engines will identify as truly unique content, but bad for content creators who can’t easily locate such infringements.

This can hit larger sites especially hard. They often offer human-translated copies of the work, usually on localized sites, and these automated scraped copies can actually beat the legitimate ones to the market, thus improving their chance of fooling humans and search engines alike into believing they are original works.

Fortunately, the likelihood of such a situation is very slim. Such automated scraping and translating has some very severe limitations that limit its potential impact on a blogger or other writer.

Reasons Not to Worry

Though this type of scraping is still definitely copyright infringement, translation and derivative works rights go to the copyright holder, there are several reasons why it is less worrisome than more traditional scraping.

The good news and the bad news are one and the same: Search engines can’t detect this kind of plagiarism.

Though that fact makes transcraping harder to detect and follow up on, it also means that you won’t be competing with the scraped copies for search engine attention. There is no way you can be bit by a duplicate content penalty, no way that you can be replaced in the search engine results and no real fear of a human confusing the scraper site as the original.

Furthermore, even if a human is able to make the connection, meaning he or she saw both sites, spoke both languages and made the connection, the automatic translation is too rough to fool anyone. Given how bad and unprofessional automated translation tends to be, the likelihood of anyone thinking that the scraped version is the original is slim to none.

Finally, this type of scraping is still fairly rare according to best guesses available. Though the technology exists, the primary target language for most scrapers and spammers is still English. If you write primarily in a foreign language, especially one that is easily translated to English, the concern might be greater but, even then, it isn’t a likely event.

For most scrapers, there is still plenty of content available in their native language to avoid having to translate anything. Since spammers tend to favor the path of least resistance, only a handful will ever opt to take this route.

Detecting Transcraping

But even though this type of scraping is not as dangerous as the more traditional variety, many Webmasters may still want to monitor this type of abuse, especially those who charge for commercial use or offer their own translations.

Though traditional detection methods may not provide much help in finding translated plagiarism, other methods can.

The key is to focus on words that will not be modified in a translation. The most obvious example of this is a name or a site address, both of which will go unmodfied. But another option includes using digital fingerprints, which should be nonsense to a translation program, and will also remain unchanged.

Thus, the best step you can take to detect transcraping is to take advantage of exisiting digital fingerprint plugins such as MaxPower’s or Copyfeed and then track those fingerprints either using Google Alerts or through the plugin’s internal search tools.

This type of automated scraping is exactly what digital fingerprints are designed to catch and, as long as a nonsense word is chosen and it is embedded in the actual text of the article, it should appear just fine in any translated version of the work.

Fortunately, the scraper bots are not yet smart enough to weed out nonsense from actual content. Once that happens, new technologies will have to be developed to detect scraping and put a stop to it.

Conclusions

In the end, there is little reason to worry specifically about translated scraping. Not only is it less likely to harm you but, if you’re using digital fingerprints correctly, odds are that you are already fairly well protected against it.

Still, it is an interesting form of content theft, at least from an academic viewpoint, and it is a symbol of how things are progressing in this field. The tools are growing more sophisticated, spammers are broadening their scope and new techniques are constantly being developed.

Though this particular method is unlikely to have a major impact in the long run, it is clear that staying on top of these issues and the advancements scrapers make will not only be more difficult, but also more important in the months and years to come.

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free

Jonathan BaileyAugust 29, 2007

4 minutes read

Want to Reuse or Republish this Content?

Follow us