I rarely get excited about upcoming anti-plagiarism products. Most seem to be overpriced, underfeatured and virtually useless to your average blogger or Webmaster. I also rarely feel the need to tread on the territory of such notable sites as Techcrunch and Mashable by covering upcoming Web 2.0 startups.
However, Blogwerx is an exception to both rules.
Blogwerx’s main product, Sentinel, which is still under development, has the potential to forever change the way blogs detect plagiarism and content theft by automatically checking for duplicate content and reporting its findings back to the site’s creator.
It’s a powerful tool that, if it works as planned, could easily change the plagiarism game for good.
What Sentinel Does
The basic idea behind Sentinel is pretty simple. Take your RSS feed along with archived entries from it, compare that content to that of other RSS feeds available on the Web and point out any large blocks of matching text. This then allows the user, and blog owner, to investigate any similarities between his feed and others for potential misuse.
According to Blogwerx, their matching algorithm works similarly to those used by Copyscape or Turnitin and is able to match partial blocks of text. This not only helps stop plagiarists that steal only a paragraph or two of writing, but also lets you see where you’re being quoted or otherwise legitimately reused. It can also help you see other sites that share quotes and other information that you wrote around, letting you see who’s talking about the similar issues in a way most search engines can not.
These checks will take place at regular intervals. Free users will have to wait the longest, up to two weeks between checks, while paying customers will get more frequent checks based upon their account level. After the checks are done, the user will then be able to view the results and follow up on any similarities that interest them.
In short, it’s a powerful system that has a variety of uses beyond just plagiarism fighting. However, the most interesting and potentially useful feature lies in the search algorithm itself and how the Blogwerx team gave Sentinel more than a pair of eyes, but also a brain.
Since Sentinel, when parsing RSS feeds, ignores all punctuation and most extremely short words, it can easily see through most simple text manipulations such as restructuring sentences and introducing false paragraph breaks. However, Blogwerx took things a step further and built in a thesaurus to Sentinel’s algorithm, making it capable of detecting copies that have been rewritten in minor ways and, potentially, even articles that have been "spun" by synonymizing software.
If this works as planned, it will put Sentinel a generation ahead of other plagiarism searching techniques, most of which require the use of a "dumb" search engine that only detects exact matches.
A drawback to the synonym checking feature of Sentinel is that, most likely, it will not be available to users of the free product. Though Blogwerx’s programmers have been able to add the feature to the service without hurting speed, the latest version of the software can process up to ten million feeds per day with the potential for many more as the service expands, the added burden of the service still prohibits it from being freely available at this time.
However, if early signs from Blogwerx are any indication, the paid versions of its service will begin at approximately five dollars a month, making it comparable to Feedburner and the most basic versions of Copysentry, the paid version of Copyscape.
Three Strikes; You’re A Splog
One of the more interesting side features to Sentinel is the ability for users to mark infringing blogs as spam blogs (splogs). After three strikes of confirmed plagiarism, the blog is officially listed as a splog and moved into a database that will be publicly available via an API.
This could, potentially, be used to create applications that work to prevent scraping or aid search engines in blacklisting useless blogs. It can also make an excellent addition to other splog databases, such as SplogSpot, that work to catch all junk blogs but may not spot outright plagiarized blogs.
The Need for Sentinel
Many will point out that much of Sentinel’s technology overlaps already existing products. Copyscape already provides algorithm matching of text, Feedburner already helps detect RSS scraping and Google Alerts can provide automated checking for duplicate content.
It seems, on the surface at least, that much of Sentinel’s functionality has already been filled. However, the potential for Sentinel, and why I am excited about it, isn’t because it can replace those services, but because they can fill in holes that they leave behind.
First off, as I discussed previously, Copyscape is ill-targeted at bloggers. It’s reliance on the Google database gives it only limited usefulness in the rapid-fire world of blogging. Delays in updates to the Google database blunt its effectiveness. Also, since many splogs and scrapers are blacklisted from Google, some of the worst offenders may not show up in Google at all.
Though Sentinel will be limited in that it will only check for plagiarism once every so often, it’s checks will be for the content immediately available, using RSS feeds to pull the latest versions of all blogs. Also, Copyscape searches are not automated and will only provide the top ten results. Considering the high rate of false positives with the service, that could leave the vast majority of misuse undetected unless you pay for the Copysentry service, which only protects ten pages at the most basic level.
Google Alerts, while automated, will share many of the same problems. Also, setting up a GA for each blog entry is a time-consuming process that doesn’t mesh well with the nature of blogging. Since the automatic generation of Google Alerts is prohibited at this time, there’s no way to integrate GAs into blogging products. Also, GAs only (reliably) detect full fledged copy and paste jobs and have no ability to detect partial reuse and/or modified content.
Finally, Feedburner, though providing valuable feed statistics and some impressive tools to deal with RSS scraping, has severe limitations. Since it can only detect reuse of your feed, it’s possible for scrapers to grab from other sources, such as Technorati watchlists and your site’s original, unprotected feed, without detection. I’ve noticed at least a few sploggers scrape some or all of my content without Feedburner noticing.
In that regard, Feedburner might be seen as a compliment to Sentinel. Feedburner detects most traditional scraping immediately and Sentinel, hopefully, will be able to pick up the rest.
The bottom line is that, while Sentinel may overlap existing technologies, it also fills gaping holes that they’ve left behind, offering a layer of protection unlike anything seen bloggers have seen before.
In the end, bloggers will have to decide whether or not they want to use Sentinel. However, since the basic version of Sentinel will be free, there will be little reason not to.
If it goes as it appears to be, it will likely service the merely curious, the protectors of copyright and the copyleft crowd alike. Anyone who is remotely interested, for any reason, about where their content is being reused will likely find something to smile about when using Sentinel.
But in terms of pure copyright protection, Sentinel will likely be very hard to beat. both for its brain and for its immediacy. It will be very interesting to see if and how Sentinel affects content reuse, both legitimate and illegitimate, after it is released.
Until then though, anyone who is interested in Sentinel should visit the Blogwerx site and add your email address to their list of potential beta testers.
It should be a very interesting launch.
(Disclaimer: I have done some unpaid consulting for Blogwerx in the past. I am not a member of the Blogwerx staff but instead volunteered my time because I was interested in the idea behind the service and felt that it had a great deal of potential to help in the fight against plagiarism.)
[tags]Plagiarism, Content Theft, Copyright Infringement, Web 2.0, RSS, Feedburner, Copyscape, Blogwerx, Sentinel[/tags]