It’s almost a daily occurense in the blogosphere. Someone, somewhere, makes a post about Copyscape. Many display the Copyscape banner , others tell anecdotes about how Copyscape helped them find a plagiarist while still others write glowing reviews about the service.
I too have written about Copyscape and my mixed feelings on it. However, as I was reading a recent spate of posts on Copyscape, I found myself wondering if Copyscape was even relevant to bloggers or if its reliance on the Google database was making it a dinosaur that was rapidly being out-evolved by the changing ways of the Web.
So, I set out to conduct a few tests and discovered that, despite its limitations, Copyscape might be of use to bloggers, especially when combined with other tools. However, it still leaves an important gap in plagiarism detection, one that it is simply unable to fill with its current system.
The Potential Problem
While there’s little doubt that Copyscape’s ability to compare two Web pages and find duplicated content is impressive, its limitation may very well be its back end. Copyscape relies on the Google database that, while probably the most complete database of Web sites on earth, is only updated at irregular intervals.
For static content, that’s fine. Plagiarism is detected when the infringing site is indexed. It doesn’t matter if that’s a few hours or a few weeks, the content will still be there and it will still be relevant.
With dynamic content, such as blogs, time is a luxury that can not be so easily afforded. Blog content changes daily, sometimes several times a day, and, though the content is usually stored in a permanent location, if the theft goes undetected for a long time, the damage, as the saying goes, might already be done.
Worse still, the way one uses Copyscape, by simply inputting the site address, could prevent blog plagiarism from being detected. Since few blogs have content older than one week on their home page, if an infringing site is not indexed at least that often, no match will be found as Copyscape will be comparing an outdated copy of the infringing site to a recent copy of the original one.
The question then becomes simple: Does the Google database update frequently enough to help bloggers detect plagiarism? The answer, sadly, isn’t simple or direct.
Testing the Database
To test Copyscape’s usefulness, I punched in addressed to blogs that I frequent. Some were top ten blogs, others were mid-range and others were virtually unknown. All in all, I tried about a dozen sites and noticed a pattern: If there was content duplication taking place, Copyscape did pick it up, albeit somewhat behind the cutting edge.
In most cases, Copyscape was able to pick up duplication that was three days old. For most blogs, this meant that the posts from the latter half of the index page were being picked up while the newer ones were not. While this meant any plagiarism of a "hot topic" was almost impossible, at least not until after much of the excitement had passed, it did mean that RSS scrapers and other incidents of ongoing plagiarism were easily detected.
From there I began to test static sites using Google to see how frequently they were indexed. The results varied wildly from a few days to several weeks. This could, theoretically, mean that static sites would have an easier time hiding plagiarism than dynamic ones, which seem to get reindexed more often. However, since plagiarism of blogs by static sites is relatively rare and the timetable still seems to favor detection, this is a minor concern.
However, these tests also exposed something else a fatal flaw with Copyscape that, in many cases, will severely limit its usefulness to bloggers.
Top Ten Only
Copyscape’s algorithm is set fairly sensitive. Most see this as a good thing because it can detect plagiarism of only a few paragraphs or even a few sentences of work. However, with blogging, quoting is very common. If you quote another site or another site quotes you, Copyscape will pick that up as well. Also, if you use a quote from a major news story and that quote is large enough, any other site that shares it will be picked up as well.
This makes blogs, especially popular ones, susceptible to false positives with Copyscape. With so much content being reused with permission, Copyscape, unable to tell which uses are plagiarized and which are accepted, just lists them all.
However, the false positives would be merely annoying if it weren’t for the fact that the free version of Copyscape limits the number of results to ten.
Since even a modest blog has more than ten sites legitimately reusing some content or otherwise sharing the same text, it’s likely that Copyscape’s top ten will be nothing but false alarms. Thus, though Copyscape’s results can be an interesting way to get a look at who’s covering the same stories and quoting you, it can be practically useless at detecting plagiarism for many sites.
Of course, Copyscape isn’t alone in its inability to help bloggers. My preferred plagiarism detection scheme, Google Alerts, is impractical for the task. Though it works perfectly for static content, especially on medium to large sites where Copyscape’s automated service, Copysentry, if financially impractical, creating an alert for every blog entry simply isn’t feasible.
In addition to the hassle of having to create an alert every time you do a post (Remember: The Google Alerts TOS forbids automatically generating alerts), the number of alerts would become unmanageable after a very short period of time. Even a blog with only a few posts a week would generate an impractical number of alerts in a year or less.
Clearly the tools that were used to detect an fight plagiarism of static content are not adequate, at least by themselves, to address the needs of the bloggers.
The Good News
Fortunately, as blog plagiarism has garnered more attention, developers have been seriously looking at ways to detect and stop it. Several tools are being developed to help combat blog plagiarism and a few have already been released.
One of the more powerful tools, the Uncommon Uses feature of Feedburner, helps detect RSS scraping, the most common kind of blog plagiarism. Also, blog search engines such as Technorati and Icerocket provide powerful searches that can help detect plagiarism in much the same way Google does now. Best of all, Technorati has an API that developers can, theoretically, hook into to develop still more powerful tools.
The end result could be that the very tools that plagiarists use to find and steal content (blog search engines, RSS feeds, pings, etc.) could become the very things that shut them down.
Tips and Ideas
Until the technology fully catches up, here are some tips to help bloggers detect plagiarism in a timely but efficient manner:
- Use Feedburner to protect your feed against undesired uses. This, when combined with Copyscape or traditional searches for your content will catch most thieves.
- If Copyscape delivers too many false positives. consider spot checking for your own work using one of the major blog search engines. Be sure to pick good quotes using the maximum length the search engine will allow and pull from areas of your story not likely to be shared innocently. (Meaning no third party quotes and no catch phrases)
- If you have an article that is very popular, very well written or on a subject likely to be plagiarized, consider punching in the permalink to the article in Copyscape or, better yet, setting up a Google Alert or two for the piece. Even though it’s impractical to use either for everything, key works can still be covered.
- Though it makes sense for other reasons, regularly check the major blog search engines for terms related to your blog. most major blog search engines will allow you to set up RSS feeds to automate this process and it can help detect sites using your work without permission.
- Finally, remember that the traditional anti-plagiarism precautions still work. Having a clear copyright notice, including one at the footer of your feed, is always a good idea as is writing in a unique voice that can easily be identified as your own and, if plagiarism is a big enough problem, an easy means for visitors to report suspected theft.
Blogging and other dynamic content sites bring new challenges to detecting and stopping plagiarism. Though the traditional plagiarism tools are useful in meeting these challenges, they are not able to provide the same level of effectiveness that they are with static content.
Fortunately, a combination of new tools and traditional ones can provide a very high level of protection, perhaps one higher than traditional methods alone in even their most ideal situations. Better still, new tools that will be coming out in the next few months will most likely take the place of the traditional tools, at least for bloggers.
Though it’s a potentially a very frustrating time in the plagiarism battle, it’s still very manageable and will most likely be improving shortly.
[tags]Plagiarism, Content Theft, Copyright Infringement, Copyright, RSS, Copyscape, Google, Technorati, Icerocket[/tags]