Transcraping: Multi-Lingual Content Theft

When Sallie Goetsch received a Google Alert for her name, she was originally excited. It meant, most likely, that one of her free Web articles had been picked up and used by another site.

However, when she followed the link, she found something else. Rather than a properly attributed article, complete with bio and proper links, she found a scraped copy of it on a spam blog. The attribution tag was missing and her name was only affixed due to the fact it was on the article itself.

But what made the case exceptional was not that her free article had been scraped within an hour of going live, but that it appeared in German and her original work was written in English.

A scraper had not only taken the content, but passed it through an automated translation service, producing both a poor translation of the piece, but also a completely different work in the eyes of the search engines.

Though Goetsch was not the first person to encounter this problem, her case does illustrate the changing face of scraping on the Web. As the Internet becomes more international and the technology used to steal content grows more advanced, this type of content theft can only become more common.

Reasons to Worry

The problem with this type of scraping is that it is almost impossible to detect. Since the text is stripped out for translation, any markers or images placed in the feed to track usage will likely be removed. Second, because the text looks completely different from the original, traditional searching will be to no avail. Finally, since these crude translations are done instantly, they can beat a more effective human translation to the Web by a matter of hours or days.

What it all amounts to is a detection nightmare. To a search engine, the translated work bears no resemblance to the original. No search engine can easily detect this kind of abuse. That works out great for the scraper, who gets what the search engines will identify as truly unique content, but bad for content creators who can’t easily locate such infringements.

This can hit larger sites especially hard. They often offer human-translated copies of the work, usually on localized sites, and these automated scraped copies can actually beat the legitimate ones to the market, thus improving their chance of fooling humans and