SoloSEO, an SEO company from Utah, announced on their blog that they are starting a database of sites that plagiarize content and are asking other Webmasters to contribute to it.
The idea is that, in addition to shaming and drawing attention to sites that steal, the list can eventually be used, in conjunction with plugins, to help blacklist sites that engage in such behavior. Ideally, it could be used to blacklist trackbacks and pings from a given site as well as, possibly, avoid the sites from being listed in search engines.
Michael Jensen, the creator of the list, said he hopes the list will be used to “point out the people who dirty the web on the backs of smart and talented individuals.” and, in the process, make the Web a little bit more free from content theft and spam blogs.
However, though these are lofty and noble goals, the service, as it sits, has a long ways to go before it becomes as useful as it is desired to be.
What it Is
As of this writing, the list is nothing impressive. It is a collection of fifteen pages across nine domains, all reported by Jensen himself, that have been accused of scraping. It also has a form in which a user can past an infringing page, an original page and their name (though anonymous posting is allowed).
One interesting feature of the site is that, next to each domain, is a link to the site’s whois information. However, this information is often wrong or anonymous. In cases of subdomains, such as with blogspot.com accounts, the information points to the original company, not the plagiarist. It also doesn’t seem to work with all registrars, prompting users to visit a second or third page to get the full information.
Still, the feature is potentially useful with top-level domains and using a different whois checker, such as my favorite Domain Tools, might produce better results.
All in all though, the current list is not much to look at. The promise, however, is more in what it could be molded into later.
What it Could Become
Jensen is actively seeking help in turning this list into something greater. Specifically, he is seeking out WordPress plugin authors who might be willing to create a reporting system that submits duplicate content trackbacks to the list at the click of a mouse. This might also be a natural addition to MaxPower’s Digital Fingerprint plugin, which helps detect scraped content on the Web.
Another interesting idea might be to create a Firefox extension that can be used to report plagiarist sites that turn up during either traditional searches or via Google Alerts. This would be a great help to Webmasters that don’t use WordPress or monitor their content through other means.
In addition to making input easier, the output could be made easier by adding plagiarist sites do various blacklists including to ones that block trackbacks and pings. Akismet, the anti-comment spam service, could potentially benefit from this kind of information as could blog search engines such as Technorati and Icerocket. Also, major pinging services, such as Weblogs.com, would be able to avoid passing on pings for know scrapers.
All in all, there is a lot of potential for this list or one like it. However, before it can achieve its full potential, there are several problems that will have to be overcome.
Problems with the Service
Though the idea is very interesting and has a lot of exciting potential uses, the SoloSEO’s list has a long way to go before it makes a major impact on the Web.
First, as of right now, there is no verification of new entries posted. The user submits an infringing URL, an original one and their name (if they wish) and the post is added. Where a DMCA notice requires full contact information, a statement under the penalty of perjury and requires other safeguards, a notice to this list is automatic and anonymous. There is no protection against false information and abuse.
The only safeguard that is offered is an email address to contact in the event that false information is posted. However, this type of “after the fact” resolution would be small comfort if this, or any similar list, were to play a vital role in stamping out duplicate content on the Web. Such a system would almost certainly be abused to list sites that were either unpopular or held controversial views, effectively chilling a portion of their speech.
Second, the public nature of this list is disturbing. I’ve written before about the Shame Game and the reasons why mob justice doesn’t work. Shame can be a deterrent, but it also leads to other problems including defamation issues, vigilantism and inaccurate information. Not only can such a list be abused by plagiarists to turn their infringement around on the original authors, but it is easy to make a simple mistake when pasting links and accidentally blacklist yourself.
These drawbacks almost always outweigh the usefulness of going public with tales of plagiarism.
Third, spammers and scrapers are able to create new accounts far faster than they can be reported. In the time it takes a user to report a scraper, the software that made the spam site can make a dozen more. This makes such a service a matter of chasing ones own tail, always being a few steps behind the worst offenders.
Finally, such a list does nothing to get the content removed. Where a DMCA notice or a cease and desist letter might get the site removed from the Web, or at least the search engines, this list simply shames the plagiarist a bit and, possibly, directs other sites to ignore it. The content still remains.
However, despite these issues, there is at least some hope for this list, provided it is able to evolve in a slightly different direction.
A New Direction
The idea behind this list is sound and it might be able to play a role in stopping or at least deterring plagiarism on the Web. However, before that happens, there will have to be several changes:
- Better Accountability: The site needs to make its users more accountable for what they say. Anonymous posting is enticing, but also very dangerous and ripe for abuse. Having users register for an account before reporting plagiarists would help as it would facilitate in banning users that obviously are abusing the system.
- Verified Information: Rather than simply accepting that the information given is valid, the list needs to do at least some rudimentary evaluation. Granted, this is a difficult task, but it might be an area that Blogstamp can help with. (previous coverage)
- Private Information: Making the list publicly available and exposing personal information opens the site up to libel suits and only offers a deterrent in a small percentage of plagiarists. It would be best if this information were hidden from public view, though perhaps included in private blacklists, until it can be independently verified. In that regard, Blogwerx has an interesting approach to dealing with this issue.
- Better Resolution: Rather than just shotgunning the information out there, it would be better if the site worked, in some manner, as an intermediary to resolve the issues. Offering a means for the accused plagiarist to contact the original author would be a step in that direction.
- Greater Cooperation: Clearly, as Jensen has stated, the service is going to need better cooperation both getting material in and making use of what’s there. There’s the potential to create a goldmine of information here that can help deal with scraping and plagiarism, but tools will need to be created on both sides to make that happen.
If those steps were taken, it would not be an ideal service and there would still be more progress to be made, but the service would at least be well on its way.
There have been other attempts to create blacklists of spam blogs and scrapers, Splogspot being one example, but most have flamed out (including Splogspot itself apparently). The difficulty in creating a useful list, maintaining it and funding it are almost always too much to bear. Furthermore, with spammers creating sites faster than they can be reported, the usefulness of any such list is dubious.
Over all, it appears that the key lies in hosts keeping spam blogs and scrapers off of their servers, as WordPress.com seems to have done, and shutting down the networks at their source by cutting off funding.
However, a list such as this one might be useful in reducing trackback spam and giving search engines forewarning of a spam blog that is operating. That could not only mitigate against the duplicate content penalty, but also prevent scrapers from gaming the blog search engines.
What remains to be seen though is if this new list will be widely used by other bloggers or remain Jensen’s personal “Hall of Shame” style site.
If it takes off and is well supported, it may become a valuable tool. Otherwise, it may just remain a personal site used to out plagiarists of that Jensen’s own literary efforts.
Either way, it would be interesting to see if such an effort could be useful and what impact it would have.