Over the course of running this site, I’ve talked a great deal about Statistically Improbably Phrases (SIPs) or phrases that, when searched for, are unlikely to point to anything other than the page or document they were first found in.
The idea is simple, the English language is very large and, though estimates of the “average” vocabulary range wildly it’s generally at least in the tens of thousands. This means that the odds of a string of exact words, especially less-common ones, appearing randomly together in the exact order they are presented in a document can be almost impossible.
This makes it easy to search for duplicate versions of an article or a document online using a good SIP. Simply put the SIP in quotes, search for it and, any results that appear most likely copied the text from the original source.
But choosing a good SIP is something of an art form that often requires a great deal of trial and error. A shorter SIP will catch more cases of copying, especially modified and edited copying, but also increases the risk of false positives as the odds of the phrase appearing randomly increase. Longer ones reduce false positives, but may miss cases where the phrase is cut off or edited.
So how short can an SIP be and still be useful? There is no hard and fast rule to this as the answer depends on how unique your writing is, but there are some general guidelines that may be useful.
Understanding Significantly Improbable Phrases
The idea behind a SIP is simple, it’s a phrase that is, or at least should be, unique to the work you are looking at. This is useful whether you’re trying to discover if the work is a copy or has been copied.
But finding a unique or nearly-unique phrase is not simple. The reasons is that not all words in the English language are used an equal amount of the time. Words such as “the” and “of” are used with such regularity as to be useless.
For example the phrase “from this second”, which is comprised completely of words from the top 500, produces over 30 million results in Google, even when used in quotes. However, “forged perplexing decisions” produces no results (at least until Google finds this article) because the words are much less common and rarely, if ever, appear in that order.
The result is that, in the latter case, a three-word SIP is probably more than adequate but, in the first, it is completely useless. Likewise, there’s a whole range of effectiveness in between the two extremes.
With that in mind, let’s take a look at yesterday’s article about the plagiarism scandals in Romania, in particular, the following phrase from it.
Plagiarism is a tool used by political opponents, like looking for inconsistent voting records or trying to discover personal indiscretions
Searching for that whole phrase, with quotes, produces only the original article. However, the phrase is too long to be of much use as it can be easily edited or truncated.
So we shorten it to read:
Plagiarism is a tool used by political opponents
The results are the same, only matching the original article.
This phrase is significantly more useful at eight words long and is much less likely to be edited or manipulated. However, it is still longer than most would consider ideal.
So we shorten it again to:
“Plagiarism is a tool”
But, this time, the phrase is no longer unique, producing at least five other results (47 others in you expand the results) and they are all unrelated. Though the phrase is still largely unique, any copies of it could be from any of the sources or could be unique.
So we expand it again to:
“Plagiarism is a tool used by”
Now the phrase is unique once again, using just six words. Also, eliminating the “by” only adds two more results, which may be a manageable amount.
As a result, either of those two versions would probably be best for use in a SIP, at least if you’re choosing to start from the word “Plagiarism”.
So how long should a statistically improbable phrase be? Traditionally, when I’ve made recommendations, I’ve suggested that they be 6-12 words long. The reason is that the odds of getting a good SIP, even at random, are pretty high if you select almost any 6-12 words in a page or document.
However, with care and effort, it can be much shorter. In fact, Amazon uses SIPs to help people find unique portions of books and, though their SIPs don’t have to be truly unique, they are typically just two words long.
In short, while 6-12 world SIPs are easy to obtain, it’s possible that you can find a valid one that’s 3-5 with a bit of work. However, it may not be worth your time as you can easily use multiple SIPs that are longer and get much the same result.
Still, as long as your SIP has at least three words that are not common and are unique, or at least rarely found elsewhere, there’s no reason why you can’t use a shorter one if you feel it would help. The best way to find out is to simply test and retest until you have something you’re happy with.