Synonymized Plagiarism: A New Threat

Plagiarists are more determined than ever to steal your content and get away with it. As the search engine wars have made content “king”, plagiarism has moved from an act of personal gratification and become a full-fledged business model. Much like how the virus/worm war escalated when spammers discovered their usefulness in sending out spam, the plagiarism war is entering a new, and frightening, territory as thieves discover its usefulness in gaining search engine ranking.

One of the critical tools in this new war is synonymizing software, which is software that takes a work and modifies it using synonyms of key words, producing a work that says practically the same thing but in a way that can’t be easily detected by search engines. This aids the plagiarist by greatly reducing the odds of their copyright infringement being discovered and prevents them from absorbing the “duplicate content” penalty some believe search engines apply.

For authors though, this can be a terrifying prospect as their hard work becomes the “seed” for tomes of search engine-friendly content that, though not making much sense, can work to bump the original from the searches and divert readers to derivatives that took only seconds to produce.

Why Worry

The problem with synonymized plagiarism is that it’s virtually undetectable by traditional means. If I were to write a play with the line “To exist, or not to exist, that is the query”, it would be recognized immediately as a rip off of Shakespeare. However, to a computer, especially a search engine, it’s a completely original sentence.

The reason is that search engines are literal. They can’t identify synonyms and even subtle changes can confuse them. In the end, there’s very little “intelligence” to search engines, just a large database of sites that are compared against text strings. The hardest part is deciding what order to display them in.

Synonymizing software takes advantage of this by exchanging words with their synonyms. Not only does this mean that they can produce thousands of variations of a piece, but it means that one who is searching for the original work will be very unlikely to find the plagiarized versions.

Worse still, even though some synonymizing products, such as ArticleBot, are poorly geared toward plagiarizing, other packages aren’t and some of their developers have even gone as far as to offer advice on how to use their software to plagiarize content and admitted to doing so themselves. One, in particular, even bragged about avoiding copyright trouble with his software and flagrantly reposted Ebooks (including several commercial ones) as search engine fodder using his tools.

Though many of these programmers are misguided about the nature of copyright law, especially in regards to the notions of derivative works and fixation, and feel that their actions are protected by it, they’re sense of security in not being caught is not entirely unwarranted. After all, even Copyscape, who provides some of the most advanced algorithms for searching for plagiarism on the Web, admits that their software can be defeated by changing enough of the words, even if only for close synonyms.

Without something changing fast, this could wind up being a problem that will swallow up copyright holders and search engines whole.

Why Not to Worry

The good news about synonymized plagiarism is that, most of the time, humans can detect it instantly. Going back to the “To exist, or not to exist” example, human readers can pick up on it almost instantly even though search engines might not.

Also, synonymized versions of a piece are almost always vastly inferior to the original in quality. As a representative from Copyscape put it,

“Good copy is ruined by applying these sorts of tools, since the author’s particular choice of words is crucial for conveying the meaning, connotation and style, and for maintaining professionalism, readability and flow. Readers recognize that copy which has been automatically modified in this way is generally of lower quality.”

As a writer, I have to agree. Most works produced by synonymizing software read as if they were written by someone who just started learning English as a second language. Technically, they are correct but they miss the nuances in the language. Though “ship” and “boat” might be synonyms of one another, they carry very different images.

The only way to correct this problem would be to spend a lot of time selecting synonyms by hand and that would completely defeat the purpose of using the synonymzing software. After all, those who are using the software are interested in speed and automation, not in doing it themselves. A word processor with a thesaurus would be adequate for that.

Finally, copyright law is very clear on this issue and, though many will claim that this is an area where the law has not caught up with the technology, the opposite is actually true. The notion of derivative works has been in copyright law for centuries and they’ve always been the sole right of the copyright holder.

In short, if such misuse can be discovered, there is no doubt that it is a violation of copyright law and can be dealt with accordingly. The only issues that are raised with this kind of software revolve around discovering and proving infringement, not if it can be legally defined as such.

Still, that doesn’t take a lot off the minds of writers who are dreading competing for search engine rankings with thousands of illegal copies of their own work. Quite literally, livelihoods are at stake.

What Can Be Done

Search engines are going to have to get smarter, not only to help identify plagiarism, but to protect themselves against a wash of almost-identical content clogging up their databases.

Theoretically, search engines can protect themselves by connecting their search queries to a good thesaurus and searching for all likely derivatives of a piece. It could also be useful for searches, enabling users to search for variations of a theme instantly rather than having to perform separate searches.

The problem with this is that is creates a huge burden on the search engines themselves. The processing power and database size requirements for maintaining this kind of service, especially if done with longer searches, would be huge. It’s unlikely that even Google could muster up those kinds of resources, especially for free.

Instead, a more complicated, but less intense solution is needed.

One method could use the word type to the search engine’s advantage. Since nearly every synonym of a word is of the same type (noun, verb, etc.), it would be easy to simply number the the word types themselves (noun=1, verb=2, etc.) and then convert the entire piece to a series of numbers by stripping out all punctuation, paragraph breaks and formatting. This would convert an entire article into a string of numbers like “1265313668235…” and that template could be matched against other strings on the Web. If a high percentage of the strings were similar, it could be flagged for review.

Another method could take a look at the words that don’t have easy synonyms, such as articles, conjunctions and other basic words, and create a template based upon that. If two articles share the same template (or large chunks of them did), they could be forwarded for review and a human could judge for certain if one was an unlawful derivative of another.

While these methods could be defeated by simply rearranging the articles, it would take much more sophisticated software to do that (while still creating a coherent work) and these techniques would be very likely to catch most simple synonymized versions of a work.

Still, these would just be temporary measures to address the issue. As the software used to twist text continued to improve and grow, such measures will become almost completely obsolete.

Obviously, in the distant future, new methods of both content protection and theft detection will have to be invented.

Final Thoughts

Personally, I don’t see why anyone would use synonymizing software in the first place. If all you’re interested in is search engine ranking and you don’t care how you get there or what your copy looks like, text generation software can produce the same results, faster and without the potential for copyright problems. There just doesn’t seem to be much reason to steal when one can simply create.

(Note: I am not condoning or disapproving of that activity. I am writing a blog about plagiarism, not SEO. I’ll leave it to those who know more about SEO to make judgments about that. Personally though, I’d prefer all content to be generated the old-fashioned way, but that’s just me, as a writer, talking.)

Still, it’s very clear, judging from the activity on synonymizing forums, that a handful of very determined individuals are determined to steal massive amounts of content for their own personal gain. They are clever individuals with no respect for copyright law or the hard work that goes into producing content for the Web.

It’s a sad thought to say the least, we’d all love the Web to be a utopia of cooperation and support, but that’s clearly never going to be the case. Instead, if we’re going to make the Web a place that’s safe for people to express themselves without fear of plagiarism, we’re going to have to first deal with some very ugly demons.

I’m hoping that the minds on our side of the battle are up to the task.

[tags]Plagiarism, SEO, ArticleBot, Copyright Infringement, Content Theft, Copyright, Copyright Law[/tags]

12 Responses to Synonymized Plagiarism: A New Threat

  1. JoeChongq says:

    I like your idea of representing an article as a series of numbers. That is something that could be done pretty easily. But it could easily be defeated though with a bit more work by the synonymizing programs. Certain phrases could easily be automatically flipped within a sentence and still keep most of the meaning.

    “To exist, or not to exist, that is the query,â€? could be written as “The query is that to exist, or not to existâ€? without removing any word. That is worse English than the original synonym translation, but it is still readable. It would even be better if “that” was dropped.

    Another example I just discovered while trying to figure out how to spell plagiarist. I got close and spell checked it in WordPerfect, then I looked at its thesaurus entries for the word. It included as a synonym, literary pirate. Multi word replacements are going to be even harder to detect in any automated method.

    But it is a start and would catch a lot of the spam since it would deal with identical plagiarism as well as synonymized plagiarism. Search engines need to do something, not only to fight plagiarism, but to maintain at least mostly spam free search results. The problem though is how to determine what is plagiarism spam and what is legitimate syndication.

    Even the best random text generators can’t create a meaningful paragraph except possibly by accident. Even though these synonymized articles clearly don’t appear well written, they are good enough to trick humans who don’t know about all these spammer/plagiarist tricks. And many good blogs are written by people who are not native English speakers so strange sentence structure and odd word connotation is not that unusual.

    Another interesting method would be to find some article in another language and translate it to English. Some free translation services are pretty good at converting certain languages to English. The opposite could be done too. No reason to limit plagiarism to English content. This could be nearly impossible to detect since sentence structures would be changed and certainly almost all the words. And even if the original author found the derivative work, he may not recognize it.

    I think the ping idea you had has some merit, but remember the vast majority of blog plagiarism victims will never find out and I suspect most who do will not have the resources or inclination to go any further than contact the offender’s host if even that much.

    It is interesting that you mentioned in this post you are a writer. Further up in this post I was just thinking, I bet this guy is a writer. Your posts are always interesting, well written, and clearly thought out.

  2. JoeChongq says:

    I like your idea of representing an article as a series of numbers. That is something that could be done pretty easily. But it could easily be defeated though with a bit more work by the synonymizing programs. Certain phrases could easily be automatically flipped within a sentence and still keep most of the meaning.

    “To exist, or not to exist, that is the query,? could be written as “The query is that to exist, or not to exist? without removing any word. That is worse English than the original synonym translation, but it is still readable. It would even be better if “that” was dropped.

    Another example I just discovered while trying to figure out how to spell plagiarist. I got close and spell checked it in WordPerfect, then I looked at its thesaurus entries for the word. It included as a synonym, literary pirate. Multi word replacements are going to be even harder to detect in any automated method.

    But it is a start and would catch a lot of the spam since it would deal with identical plagiarism as well as synonymized plagiarism. Search engines need to do something, not only to fight plagiarism, but to maintain at least mostly spam free search results. The problem though is how to determine what is plagiarism spam and what is legitimate syndication.

    Even the best random text generators can’t create a meaningful paragraph except possibly by accident. Even though these synonymized articles clearly don’t appear well written, they are good enough to trick humans who don’t know about all these spammer/plagiarist tricks. And many good blogs are written by people who are not native English speakers so strange sentence structure and odd word connotation is not that unusual.

    Another interesting method would be to find some article in another language and translate it to English. Some free translation services are pretty good at converting certain languages to English. The opposite could be done too. No reason to limit plagiarism to English content. This could be nearly impossible to detect since sentence structures would be changed and certainly almost all the words. And even if the original author found the derivative work, he may not recognize it.

    I think the ping idea you had has some merit, but remember the vast majority of blog plagiarism victims will never find out and I suspect most who do will not have the resources or inclination to go any further than contact the offender’s host if even that much.

    It is interesting that you mentioned in this post you are a writer. Further up in this post I was just thinking, I bet this guy is a writer. Your posts are always interesting, well written, and clearly thought out.

  3. [...] Since Sentinel, when parsing RSS feeds, ignores all punctuation and most extremely short words, it can easily see through most simple text manipulations such as restructuring sentences and introducing false paragraph breaks. However, Blogwerx took things a step further and built in a thesaurus to Sentinel’s algorithm, making it capable of detecting copies that have been rewritten in minor ways and, potentially, even articles that have been "spun" by synonymizing software. [...]

  4. [...] Since Sentinel, when parsing RSS feeds, ignores all punctuation and most extremely short words, it can easily see through most simple text manipulations such as restructuring sentences and introducing false paragraph breaks. However, Blogwerx took things a step further and built in a thesaurus to Sentinel’s algorithm, making it capable of detecting copies that have been rewritten in minor ways and, potentially, even articles that have been “spun” by synonymizing software. [...]

  5. [...] technology behind modified scraping has been around for several years. I first wrote about it in December of 2005. Back then the problem was fairly rare and the concept was still somewhat new. [...]

  6. [...] type of scraping is not as uncommon as we might wish and the technology to do it has been around for several years. Worse still, this type of scraping is growing much more popular as search engines clamp down on [...]

  7. [...] content through your blog’s feed and inserting or replacing synonyms in the content, typically keywords the splogger needs to get the page ranking and search terms to attract [...]

  8. Hey Jonathan,

    Great piece, and the reason iThenticate pattern recognition is so valuable. Our software is effective in identifying word substitution and/or sentence addition. See http://plagiarism.org/plag_solutions.html for details. Also, do you consider synonimizers and article spinners to be one in the same?

  9. [...] If you can change out enough words, you can easily fool plagiarism checkers. This is precisely how synonymized or “spinning” plagiarism [...]

  10. [...] find it cheaper and easier to shell out a small amount of money on PLR content that they can then run through content spinners and then generate thousands of low-quality articles from the set that are at least somewhat unique [...]

  11. Solid Snake says:

    Property is a solution to things being scarce. Because there are only so many apples in the world, we have to come to an agreement on how those apples are to be distributed, if we are rational and moral agents (and only moral agents can have rights). There is only one possible solution as all other solutions result in logical contradictions. That solution is called property. If apples were not scarce, they could not be owned, much like the air you breath is not property because it is not scarce. You having an idea does not preclude me from having the same idea, therefore ideas are not property since they can be duplicated freely. Understand I am not against trade secrets or you keeping your idea to yourself, but ideas are not and cannot be property because they are not scarce.

    Plagiarism is simply copying while not giving credit to the person that created it, it’s a missrepresentation, it’s a lie, which of course we must be against. It is not however theft, since the owner still has his copy. Copying is not and cannot be theft.

Leave a Reply

STAY CONNECTED