If reading this site and enduring the onslaught of content theft, plagiarism and scraping is making you think about packing up and shutting down your site, you might want to think again.
These days, even death does not put an end to content theft. It merely opens up new avenues for it.
As a recent article on Blue Hat SEO pointed out, nothing is ever really deleted from the Web.
Caching sites and archives hold on to your content long after the page has been removed and, as the article demonstrates, anything that is available can be scraped.
It is the equivalent of picking a dead man’s pocket, but it is a type of plagiarism that can and does happen. It is also a kind of plagiarism that raises a whole slew of new questions and concerns.
How it is Done
The process for scraping a dead site is surprisingly simple, involving only five steps.
- Visit an archiving site such as The Internet Archive
- Lookup an old site that you know to be deceased
- Find an old, but still relevant, article
- Check to make sure that the article has not been posted elsewhere
- Copy/paste it onto your site
As the original article points out, there are ways to automate this process, such as using sitemaps, but the process clearly works best by hand. With that in mind, this method is unlikely to be used by professional spam bloggers, who need more content than this method could provide, but could be favored by human-powered spam sites that are trying to appear legit.
To those Webmasters, the unethical ones catering to humans and the search engine bots, this type of content theft can seem like a dream come true, especially when one weighs the advantages that come with it.
The Advantages for Scrapers
When looking at why a scraper or a plagiarist would find this kind of content appealing, consider the following:
- Since the content has been removed from the search engines, there is no possible duplicate content penalty and no original site to compete with in the search engine rankings
- If the original site is shut down, most likely, the people charged with protecting the content are no longer looking for infringements. Thus, the odds of finding legal trouble are slim to none.
- Finally, there is a wealth of dead information out there, ready to be copied. Though the Web is growing, sites are also dying every day, thus ensuring a steady stream flowing into a large, growing pool.
Of course, taking advantage of such a system requires a certain amount of cleverness and work ethic, things that are in short supply for most that would steal content, but if one sees the advantages as outweighing the drawbacks, it would still be an appealing option.
However, given the wide range of legitimate, easily-accessed content available to scrape, it seems unlikely that most infringers would take this route.
Despite that, there is little doubt that at least some do scrape dead sites for at least part of their content and that fact raises some difficult questions that don’t come with easy answers.
Questions to Ponder
Obviously, given the current length of copyright law, any such scraping would likely be an infringement. Unless the work was placed under a Creative Commons license or donated to the public domain, the work is almost certainly still protected and the copying is illegal.
However, with the business interest in protecting those works gone, any copyright is likely to go unenforced. Most won’t care to check for any infringement and those who do will likely be unaware of the potential danger. Most feel that, when the site is removed, the risk of plagiarism disappears.
Worse still, the authors of the work, who often created the works as a work for hire and hold no copyright interest in it, would be the ones holding the greatest continued stake in the work. Though their employers have long since closed up shop, their name and reputation remains affixed to the works.
But even if the copyright holder does detect such post-mortem plagiarism and expresses an interest in defending those works, going through the motions could be difficult. Though the DMCA does not state that URLs must be provided with a DMCA notice, some hosts have that requirement and it is the most common way of identifying the original works. With the site down, that element could prove difficult.
Furthermore, since there would be no economic damages and it is unlikely that the work was registered, suing for said infringement would be almost financially impossible. In short, though stopping such plagiarism would be possible, it would be more difficult and less worth the time spent.
It may not be the perfect crime, but it is certainly pretty close. It’s very hard to envision a scraper paying for such an infringement, no matter how unethical or illegal it is.
Prevention is the Key
Since enforcing the copyrights of a dead site is, in many cases, impractical, the best approach is to look at preventing the infringement from happening in the first place. Doing that involves having a sound shut-down strategy that includes the following steps:
- Handle all existing plagiarism cases: Deal with all ongoing plagiarism and content theft cases that you can. Ensure that the content is removed completely before beginning to take content offline.
- Block known archiving sites: Edit your robots.txt file to exclude the Internet Archiveefbyysvuazxbvrqayxbruxufzzfrtwv and other archiving services. All cached copies of your site should be removed within a few days. (Note: You can skip this step in favor of step three but it is worth being certain that these archiving sites drop your cached copies.)
- Block All Search Engines: Since search engine caches can remain active for some time after a site goes down, once you are certain the archive sites have removed the work, edit the robots.txt file to exclude all search engines. You can also use meta tags to prevent indexing, or both to ensure that all spiders stop indexing your site.
- Move Content to a Hidden Location: Move a copy of your site to a hidden, but accessible, location. Consider making the site password protected to ensure that it can not be indexed or visited by anyone you do not personally authorize.
- Remove the Original Site: Take down the site and close up shop, being certain to leave behind a means of contact in the event a problem is discovered later.
Though the process seems lengthy and difficult, it can easily be done over a couple of days with most of the time spent waiting on the spiders to re-index your site. Once they do, the cached copies should be dropped almost immediately.
The key thing in all of it is to ensure that the cached copies of your work are removed BEFORE you take down the site. As long as you ensure that the major caching services, especially the Internet Archive, have removed your work before you shut down, you can greatly reduce the risk of post-mortem plagiarism.
The good news is that, though there is no way to tell how common this kind of content theft is, it almost certainly is very rare. In most cases, the meager rewards do not justify the effort required. Also, when it does happen, it will largely limited to the larger, better-known sites that are now defunct as it requires at least some advanced knowledge of the site’s existence.
The bottom line is that the content on any Web site, even a closed one, still has value. This is especially true if the rightsholder is looking to start up another venture in the future or is pursuing off-line methods of distribution. The fact that the site is down does not mean it is acceptable to exploit the effort and expense of the original author.
Perhaps the strangest side effect of this is the damage it does to the classic classic scraper mantra “If you don’t want your content scraped, get off the Web”. If shutting down a site does not put a definite end to content theft, then dealing with plagiarism on a live sight becomes an even more practical solution.
In the end, it seems that the only sure fire way to avoid scraping is to not put it on the Web in the first place. As it sits right now, once the content is on the Web, there is no way to erase it.