Synonymized Plagiarism: A New Threat

Plagiarists are more determined than ever to steal your content and get away with it. As the search engine wars have made content “king”, plagiarism has moved from an act of personal gratification and become a full-fledged business model. Much like how the virus/worm war escalated when spammers discovered their usefulness in sending out spam, the plagiarism war is entering a new, and frightening, territory as thieves discover its usefulness in gaining search engine ranking.

One of the critical tools in this new war is synonymizing software, which is software that takes a work and modifies it using synonyms of key words, producing a work that says practically the same thing but in a way that can’t be easily detected by search engines. This aids the plagiarist by greatly reducing the odds of their copyright infringement being discovered and prevents them from absorbing the “duplicate content” penalty some believe search engines apply.

For authors though, this can be a terrifying prospect as their hard work becomes the “seed” for tomes of search engine-friendly content that, though not making much sense, can work to bump the original from the searches and divert readers to derivatives that took only seconds to produce.

Why Worry

The problem with synonymized plagiarism is that it’s virtually undetectable by traditional means. If I were to write a play with the line “To exist, or not to exist, that is the query”, it would be recognized immediately as a rip off of Shakespeare. However, to a computer, especially a search engine, it’s a completely original sentence.

The reason is that search engines are literal. They can’t identify synonyms and even subtle changes can confuse them. In the end, there’s very little “intelligence” to search engines, just a large database of sites that are compared against text strings. The hardest part is deciding what order to display them in.

Synonymizing software takes advantage of this by exchanging words with their synonyms. Not only does this mean that they can produce thousands of variations of a piece, but it means that one who is searching for the original work will be very unlikely to find the plagiarized versions.

Worse still, even though some synonymizing products, such as ArticleBot, are poorly geared toward plagiarizing, other packages aren’t and some of their developers have even gone as far as to offer advice on how to use their software to plagiarize content and admitted to doing so themselves. One, in particular, even bragged about avoiding copyright trouble with his software and flagrantly reposted Ebooks (including several commercial ones) as search engine fodder using his tools.

Though many of these programmers are misguided about the nature of copyright law, especially in regards to the notions of derivative works and fixation, and feel that their actions are protected by it, they’re sense of security in not being caught is not entirely unwarranted. After all, even Copyscape, who provides some of the most advanced algorithms for searching for plagiarism on the Web, admits that their software can be defeated by changing enough of the words, even if only for close synonyms.

Without something changing fast, this could wind up being a problem that will swallow up copyright holders and search engines whole.

Why Not to Worry

The good news about synonymized plagiarism is that, most of the time, humans can detect it instantly. Going back to the “To exist, or not to exist” example, human readers can pick up on it almost instantly even though search engines might not.

Also, synonymized versions of a piece are almost always vastly inferior to the original in quality. As a representative from Copyscape put it,

“Good copy is ruined by applying these sorts of tools, since the author’s particular choice of words is crucial for conveying the meaning, connotation and style, and for maintaining professionalism, readability and flow. Readers recognize that copy which has been automatically modified in this way is generally of lower quality.”

As a writer, I have to agree. Most works produced by synonymizing software read as if they were written by someone who just started learning English as a second language. Technically, they are correct but they miss the nuances in the language. Though “ship” and “boat” might be synonyms of one another, they carry very different images.

The only way to correct this problem would be to spend a lot of time selecting synonyms by hand and that would completely defeat the purpose of using the synonymzing software. After all, those who are using the software are interested in speed and automation, not in doing it themselves. A word processor with a thesaurus would be adequate for that.

Finally, copyright law is very clear on this issue and, though many will claim that this is an area where the law has not caught up with the technology, the opposite is actually true. The notion of derivative works has been in copyright law for centuries and they’ve always been the sole right of the copyright holder.

In short, if such misuse can be discovered, there is no doubt that it is a violation of copyright law and can be dealt with accordingly. The only issues that are raised with this kind of software revolve around discovering and proving infringement, not if it can be legally defined as such.

Still, that doesn’t take a lot off the minds of writers who are dreading competing for search engine rankings with thousands of illegal copies of their own work. Quite literally, livelihoods are at stake.

What Can Be Done

Search engines are going to have to get smarter, not only to help identify plagiarism, but to protect themselves against a wash of almost-identical content clogging up their databases.

Theoretically, search engines can protect themselves by connecting their search queries to a good thesaurus and searching for all likely derivatives of a piece. It could also be useful for searches, enabling users to search for variations of a theme instantly rather than having to perform separate searches.

The problem with this is that is creates a huge burden on the search engines themselves. The processing power and database size requirements for maintaining this kind of service, especially if done with longer searches, would be huge. It’s unlikely that even Google could muster up those kinds of resources, especially for free.

Instead, a more complicated, but less intense solution is needed.

One method could use the word type to the search engine’s advantage. Since nearly every synonym of a word is of the same type (noun, verb, etc.), it would be easy to simply number the the word types themselves (noun=1, verb=2, etc.) and then convert the entire piece to a series of numbers by stripping out all punctuation, paragraph breaks and formatting. This would convert an entire article into a string of numbers like “1265313668235…” and that template could be matched against other strings on the Web. If a high percentage of the strings were similar, it could be flagged for review.

Another method could take a look at the words that don’t have easy synonyms, such as articles, conjunctions and other basic words, and create a template based upon that. If two articles share the same template (or large chunks of them did), they could be forwarded for review and a human could judge for certain if one was an unlawful derivative of another.

While these methods could be defeated by simply rearranging the articles, it would take much more sophisticated software to do that (while still creating a coherent work) and these techniques would be very likely to catch most simple synonymized versions of a work.

Still, these would just be temporary measures to address the issue. As the software used to twist text continued to improve and grow, such measures will become almost completely obsolete.

Obviously, in the distant future, new methods of both content protection and theft detection will have to be invented.

Final Thoughts

Personally, I don’t see why anyone would use synonymizing software in the first place. If all you’re interested in is search engine ranking and you don’t care how you get there or what your copy looks like, text generation software can produce the same results, faster and without the potential for copyright problems. There just doesn’t seem to be much reason to steal when one can simply create.

(Note: I am not condoning or disapproving of that activity. I am writing a blog about plagiarism, not SEO. I’ll leave it to those who know more about SEO to make judgments about that. Personally though, I’d prefer all content to be generated the old-fashioned way, but that’s just me, as a writer, talking.)

Still, it’s very clear, judging from the activity on synonymizing forums, that a handful of very determined individuals are determined to steal massive amounts of content for their own personal gain. They are clever individuals with no respect for copyright law or the hard work that goes into producing content for the Web.

It’s a sad thought to say the least, we’d all love the Web to be a utopia of cooperation and support, but that’s clearly never going to be the case. Instead, if we’re going to make the Web a place that’s safe for people to express themselves without fear of plagiarism, we’re going to have to first deal with some very ugly demons.

I’m hoping that the minds on our side of the battle are up to the task.

[tags]Plagiarism, SEO, ArticleBot, Copyright Infringement, Content Theft, Copyright, Copyright Law[/tags]