<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Plagiarism Todayrobots.txt | Plagiarism Today</title>
	<atom:link href="http://www.plagiarismtoday.com/tag/robotstxt/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.plagiarismtoday.com</link>
	<description>Content Theft, Plagiarism, Copyright Infringement</description>
	<lastBuildDate>Mon, 13 Feb 2012 06:51:37 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Blocking Bad Bots with Robots.txt</title>
		<link>http://www.plagiarismtoday.com/2010/06/10/blocking-bad-bots-with-robots-txt/</link>
		<comments>http://www.plagiarismtoday.com/2010/06/10/blocking-bad-bots-with-robots-txt/#comments</comments>
		<pubDate>Thu, 10 Jun 2010 17:47:33 +0000</pubDate>
		<dc:creator>Jonathan Bailey</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[bots]]></category>
		<category><![CDATA[chattels]]></category>
		<category><![CDATA[Content-Theft]]></category>
		<category><![CDATA[Copyright]]></category>
		<category><![CDATA[Copyright-Infringement]]></category>
		<category><![CDATA[Copyright-Law]]></category>
		<category><![CDATA[Plagiarism]]></category>
		<category><![CDATA[robots.txt]]></category>
		<category><![CDATA[spiders]]></category>
		<category><![CDATA[trespass to chattels]]></category>

		<guid isPermaLink="false">http://www.plagiarismtoday.com/?p=6837</guid>
		<description><![CDATA[Using Robots.txt to block bad bots may not be perfect, but it is incredibly fast and there are two tools to make it even easier. ]]></description>
			<content:encoded><![CDATA[<p><img style=' float: left; padding: 4px; margin: 0 7px 2px 0;'  src="http://www.plagiarismtoday.com/wp-content/uploads/2010/06/clickability-logo.jpg" alt="" title="clickability-logo" width="315" height="68" class="alignleft size-full wp-image-6838"></p>
<p>When it comes to things crawling your site, there are good bots and bad bots. Good bots, like Google&#8217;s spider, crawl your site to index it for search engines or provide some other symbiotic use. Others spider your site for more nefarious reasons such as stripping out your content for republishing, downloading whole archives of your site or extracting your images. </p>
<p>The question is simple. How do you block the bad bots while welcoming the good ones in? Fortunately there is a standard, <a href="http://www.webconfs.com/what-is-robots-txt-article-12.php">robots.txt</a>, that can do just that. </p>
<p>However, editing your robots.txt file can be a daunting task, even for experienced Webmasters, as it requires mastering a very specific and somewhat unusual format. To make matters worse, even those who understand how such files work rarely know the names of all the bots to let in or block, limiting what they can do with it.</p>
<p>Fortunately, there is both a WordPress plugin and a robots.txt generator that can help. What is unclear, however, is if they will actually work.<span id="more-6837"></span></p>
<h4>A Quick Word About Robots.txt</h4>
<p>The basic idea behind robots.txt is that it is meant to be guide for robots (or spiders are they are often called) when they visit your site. It tells the bot the pages and directories they can and can not visit and can either set up blanket instructions for all bots or instructions for specific bots.</p>
<p>This is done by, appropriately, including a file titled robots.txt in the root of your server. For example, Plagiarism Today&#8217;s (deliberately open) robots.txt file can be found at <a href="http://www.plagiarismtoday.com/robots.txt">http://www.plagiarismtoday.com/robots.txt</a>.</p>
<p>Since robots.txt makes it possible to filter bots by their identifier, it can be used to let only certain spiders into your site while keeping others out. The problem is that there are, quite literally, hundreds of different spiders out there and new ones being written all the time. It is almost impossible to keep on top of what spiders you should allow and those that you should banish.</p>
<p>Fortunately, there are two tools that may be able to help you do exactly that by giving you the information you need to build a robots.txt file that&#8217;s capable of keeping at least some of the bad bots at bay.</p>
<h4>Building a Better Robots.txt</h4>
<p>To help you build an effective robots.txt file, there are two tools you may wish to look at:</p>
<ol>
<li><strong><a href="http://petercoughlin.com/robotstxt-wordpress-plugin/">Robots.txt WordPress Plugin</a>:</strong> Written by Peter Coughlin, this WordPress plugin largely automates the process of building a robots.txt file that can keep bad bots out while letting in the good ones. Though largely hands-free, you can tweak your own robots.txt file through this plugin, eliminating the need to access it via FTP. Also, the plugin is careful not to overwrite any existing file and will uninstall gracefully if requested.</li>
<li><strong><a href="http://www.clickability.co.uk/robotstxt.html">Robots.txt Bulder</a>:</strong> Provided by David Naylor, this robots.txt builder lets you select from categories of bots, including search engines, archivers and, of course, &#8220;bad robots&#8221; and decide the level of access each group has. When done, simply paste the results from the builder into your robots.txt file and upload it to your server.</li>
</ol>
<p>The two tools are actually related as Coughlin&#8217;s WordPress plugin actually uses Naylor&#8217;s list of bad bots to help it determine which bots to keep out. So, in that regard, the WordPress plugin is essential an automated install of Naylor&#8217;s list for WordPress users. </p>
<p>This also means that both methods function largely the same way, by compiling a list of bad bots and, using robots.txt, telling them to keep out while throwing the doors wide open to other kinds of spiders that may wish to crawl your site.</p>
<p>However, it is unclear exactly how effective this method is and the reason is the nature of the bad bots themselves.</p>
<h4>Limitations and Concerns</h4>
<p>The problem with using robots.txt in this way is that, for it to work, it requires the cooperation of the robots themselves. Robots.txt doesn&#8217;t do anything to actually restrict the bots from accessing various parts of your site, merely tell them where you wish to allow them to go or not go. In short, if a bot wants to ignore your robots.txt guidelines, they can.</p>
<p>Legitimate bots obey robots.txt for a variety of legal and ethical reasons. Googlebot, for example, will always adhere to your robots.txt rules. Bad bots, however, are free to ignore them and often do. </p>
<p>This doesn&#8217;t mean using robots.txt to block bad bots is completely ineffective. The reason is that many &#8220;bad&#8221; bots are actually good or neutral ones used for bad reasons. For example, there are many download spiders that can be used for good or bad reasons but, unless you have a specific reason to allow them, you should probably block them as they serve little positive use for you. Those will, by in large, obey robots.txt.</p>
<p>Also, many bad bots will obey robots.txt simply because so few sites bother to block them and there is little reason for them not to. Since not following the standard can open up potential legal challenges against them, including potential copyright issues and tresspass to chattels, though such a theory <a href="http://w2.eff.org/spam/20011218_eff_trespasstc_analysis.php">presents its own dangers and challenges</a>. there is great risk in not following them and little to gain by ignoring them. </p>
<p>There are still other ways around this. Some nefarious bots will change their name, even randomize it to avoid being targeted by robots.txt (instead gaining the rights granted all other bots not specifically listed, which is usually more liberal) but most that don&#8217;t want to obey robots.txt will simply not do so and unless the server takes some additional effort to block them, such as <a href="http://www.plagiarismtoday.com/2007/07/02/using-htaccess-to-stop-content-theft/">using an IP or user-agent block</a>, both of which can also be mitigated against, there isn&#8217;t much to stop them.</p>
<p>One other possibility is to <a href="http://www.plagiarismtoday.com/2006/07/17/cloaking-to-stop-scraping/">use cloaking to trick scrapers into grabbing the wrong content</a>,  but that possibility also carries its own risks with it. </p>
<h4>Bottom Line</h4>
<p>Given the fact that there is very little risk with manipulating your robots.txt file to block bad bots, there is very little reason to not do it. It may not block all or even most, but it can block some and that can make it of some small benefit.</p>
<p>The problem is that, if more webmasters start blocking bad bots via robots.txt, the method will become less effective as more and more decide the risk of ignoring it is worthwhile. However, that would require a very large number engage in the practice, which is unlikely considering that many sites don&#8217;t even have control over their robots.txt file.</p>
<p>In short, if you can do it, you probably should. It&#8217;s an extra layer of protection against those who might wish to misuse your site or your content. It is not much, but certainly every little bit helps. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.plagiarismtoday.com/2010/06/10/blocking-bad-bots-with-robots-txt/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Pop Quiz 1: Google Answers</title>
		<link>http://www.plagiarismtoday.com/2008/10/07/pop-quiz-1-google-answers/</link>
		<comments>http://www.plagiarismtoday.com/2008/10/07/pop-quiz-1-google-answers/#comments</comments>
		<pubDate>Tue, 07 Oct 2008 16:23:12 +0000</pubDate>
		<dc:creator>Jonathan Bailey</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[Content-Theft]]></category>
		<category><![CDATA[Copyright-Infringement]]></category>
		<category><![CDATA[DMCA]]></category>
		<category><![CDATA[DMCA-notice]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[meta-tags]]></category>
		<category><![CDATA[notice-and-takedown]]></category>
		<category><![CDATA[Plagiarism]]></category>
		<category><![CDATA[quiz]]></category>
		<category><![CDATA[robots.txt]]></category>

		<guid isPermaLink="false">http://www.plagiarismtoday.com/?p=1881</guid>
		<description><![CDATA[After introducing the Pop Quiz content last week, I'm returning today with the answers to the questions along with a few useful links for controlling your content in Google. ]]></description>
			<content:encoded><![CDATA[<p><img src="http://www.plagiarismtoday.com/wp-content/uploads/2008/10/google-logo-1.png" alt="google-logo-1.png" border="0" width="271" height="106" align="left" class="picleft" />Last week, I threw a curveball into my usual posting mix, asking a pop quiz of seven questions dealing with the topic of controlling your content in Google.</p>
<p>The quiz seemed to generate some interest, just not the kind I had expected. A few people wrote me to ask if one of the questions was possible and, if so, how to do it. Still the, first poster, <a href="http://www.mmmeeja.com/">Andy Murdoch of MMMeeja</a>, a very neat looking Web design and development company, was the one to get the questions right.</p>
<p>So what were the correct answers and why? We&#8217;re going to take a look at the questions one at a time to see what the right answer was and <span id="more-1881"></span><br />
<h4>The Answers</h4>
<p><strong>What are three ways you can request your site be removed from Google search? (Note: This is your own site or a page on it, not an infringer)</strong></p>
<p>Not surprisingly, there was more than three correct answers. The most common ways involve using <a href="http://www.robotstxt.org/">robots.txt</a> or meta tags. But you can also use <a href="http://www.google.com/webmasters/tools/">Google Webmaster tools</a> to remove the site or you can direct your server to respond to Google with an error code. </p>
<p>There are other ways, but those are the most common.</p>
<p><strong>What is the email address for the Google DMCA agent? (please only include the part before the @).</strong></p>
<p>The answer is DMCA-agent. This was actually something of a trick question. Google recently changed its DMCA information, the former address was amac. However, since I&#8217;ve mentioned the previous address on this site and it appears to still work, I would have accepted it as an answer.</p>
<p><strong>What, according to Matt Cutts, is the most effective way to report spam to Google?</strong></p>
<p>The answer was right here on Plagiarism Today. <a href="http://www.plagiarismtoday.com/2008/03/21/the-best-way-to-report-spam-to-google/">Matt Cutts has said that spam reports filed through Google Webmaster Tools are given much more weight</a> than those files through other means. </p>
<p><strong>What is the name of the Google service that will email you search results as they are picked up by Google?</strong></p>
<p><a href="http://www.google.com/alerts">Google Alerts</a>. I&#8217;ve mentioned this service many times on this site and it remains one of their best anti-plagiarism/copyright infringement tools. </p>
<p><strong>What is the name of the Google spider, or rather, the name you need to refer to it as when you are trying to block it?</strong></p>
<p>Googlebot. This one was pretty simple actually and you can find it on any number of sites. </p>
<p><strong>What is the meta tag command to prevent Google from displaying a sample of your content in their results pages?</strong></p>
<p>This was the one that seemed to pique people&#8217;s interest. I got a couple of emails asking me if this was even possible. It&#8217;s actually pretty simple. The meta tag command &#8220;NOSNIPPET&#8221; will prevent Google from displaying the snippet in their results pages.</p>
<p>You can read about <a href="http://www.webmarketingnow.com/tips/meta-tags-google-meta-tags.html">this and other neat Google Meta Tags here</a>. </p>
<p><strong>What search command lets you see approximately how many pages Google has indexed on a site?</strong></p>
<p>site:yourdomain.com. Important to note that this method is rather dubious in its effectiveness. The results tend to differ wildly from search to search but can still give you a rough idea of the range of indexed pages.</p>
<h4>Conclusions</h4>
<p>All in all, most of the questions in this quiz were pretty basic. Outside of the Matt Cutts and the NOSNIPPET questions, most of the questions could be answered any number of places. </p>
<p>The goal was to get people thinking about how Google users their content and what tools they are provided with to control that. </p>
<p>I hope everyone enjoyed this quiz and look forward to another one in the near future!</p>
<p><strong>Special Thanks:</strong> To <a href="http://rvb.roosterteeth.com/archive/episode.php?id=359">Red Vs. Blue</a> for the CTRL+ALT+BINGO Joke. Too good not to reuse&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.plagiarismtoday.com/2008/10/07/pop-quiz-1-google-answers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Picking a Dead Man&#8217;s Pocket</title>
		<link>http://www.plagiarismtoday.com/2007/07/19/picking-a-dead-mans-pocket/</link>
		<comments>http://www.plagiarismtoday.com/2007/07/19/picking-a-dead-mans-pocket/#comments</comments>
		<pubDate>Thu, 19 Jul 2007 16:23:39 +0000</pubDate>
		<dc:creator>Jonathan Bailey</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[DMCA]]></category>
		<category><![CDATA[Legal Issues]]></category>
		<category><![CDATA[Prevention]]></category>
		<category><![CDATA[archive]]></category>
		<category><![CDATA[caching]]></category>
		<category><![CDATA[Content-Theft]]></category>
		<category><![CDATA[Copyright]]></category>
		<category><![CDATA[Copyright-Infringement]]></category>
		<category><![CDATA[Internet-Archive]]></category>
		<category><![CDATA[meta-tags]]></category>
		<category><![CDATA[Plagiarism]]></category>
		<category><![CDATA[robots.txt]]></category>
		<category><![CDATA[RSS]]></category>
		<category><![CDATA[Scraping]]></category>
		<category><![CDATA[Spam]]></category>
		<category><![CDATA[spammers]]></category>
		<category><![CDATA[Splogs]]></category>

		<guid isPermaLink="false">http://www.plagiarismtoday.com/2007/07/19/picking-a-dead-mans-pocket/</guid>
		<description><![CDATA[If reading this site and enduring the onslaught of content theft, plagiarism and scraping is making you think about packing up and shutting down your site, you might want to think again. These days, even death does not put an end to content theft. It merely opens up new avenues for it. As a recent...]]></description>
			<content:encoded><![CDATA[<p>If reading this site and enduring the onslaught of content theft, plagiarism and scraping is making you think about packing up and shutting down your site, you might want to think again.</p>
<p>These days, even death does not put an end to content theft. It merely opens up new avenues for it.</p>
<p>As a recent article on <a href="http://www.bluehatseo.com/black-hole-seo-desert-scraping/">Blue Hat SEO pointed out</a>, nothing is ever really deleted from the Web. </p>
<p>Caching sites and archives hold on to your content long after the page has been removed and, as the article demonstrates, anything that is available can be scraped.</p>
<p>It is the equivalent of picking a dead man&#8217;s pocket, but it is a type of plagiarism that can and does happen. It is also a kind of plagiarism that <a href="http://www.nusuni.com/blog/2007/07/06/scraping-old-content-from-dead-sites-can-still-be-copyright-infringement-and-can-still-cause-seo-issues/#comment-10632">raises a whole slew of new questions</a> and concerns.</p>
<p><span id="more-547"></span><strong>How it is Done</strong></p>
<p>The process for scraping a dead site is surprisingly simple, involving only five steps.</p>
<ol>
<li>Visit an archiving site such as <a href="http://www.archive.org">The Internet Archive</a></li>
<li>Lookup an old site that you know to be deceased</li>
<li>Find an old, but still relevant, article</li>
<li>Check to make sure that the article has not been posted elsewhere</li>
<li>Copy/paste it onto your site</li>
</ol>
<p>As the original article points out, there are ways to automate this process, such as using sitemaps, but the process clearly works best by hand. With that in mind, this method is unlikely to be used by professional spam bloggers, who need more content than this method could provide, but could be favored by human-powered spam sites that are trying to appear legit.</p>
<p>To those Webmasters, the unethical ones catering to humans and the search engine bots, this type of content theft can seem like a dream come true, especially when one weighs the advantages that come with it.</p>
<p><strong>The Advantages for Scrapers</strong></p>
<p>When looking at why a scraper or a plagiarist would find this kind of content appealing, consider the following:</p>
<ol>
<li>Since the content has been removed from the search engines, there is no possible duplicate content penalty and no original site to compete with in the search engine rankings</li>
<li>If the original site is shut down, most likely, the people charged with protecting the content are no longer looking for infringements. Thus, the odds of finding legal trouble are slim to none.</li>
<li>Finally, there is a wealth of dead information out there, ready to be copied. Though the Web is growing, sites are also dying every day, thus ensuring a steady stream flowing into a large, growing pool.</li>
</ol>
<p>Of course, taking advantage of such a system requires a certain amount of cleverness and work ethic, things that are in short supply for most that would steal content, but if one sees the advantages as outweighing the drawbacks, it would still be an appealing option.</p>
<p>However, given the wide range of legitimate, easily-accessed content available to scrape, it seems unlikely that most infringers would take this route. </p>
<p>Despite that, there is little doubt that at least some do scrape dead sites for at least part of their content and that fact raises some difficult questions that don&#8217;t come with easy answers.</p>
<p><strong>Questions to Ponder</strong></p>
<p>Obviously, given the current length of copyright law, any such scraping would likely be an infringement. Unless the work was placed under a Creative Commons license or donated to the public domain, the work is almost certainly still protected and the copying is illegal.</p>
<p>However, with the business interest in protecting those works gone, any copyright is likely to go unenforced. Most won&#8217;t care to check for any infringement and those who do will likely be unaware of the potential danger. Most feel that, when the site is removed, the risk of plagiarism disappears.</p>
<p>Worse still, the authors of the work, who often created the works as a work for hire and hold no copyright interest in it, would be the ones holding the greatest continued stake in the work. Though their employers have long since closed up shop, their name and reputation remains affixed to the works.</p>
<p>But even if the copyright holder does detect such post-mortem plagiarism and expresses an interest in defending those works, going through the motions could be difficult. Though the DMCA <a href="http://www.plagiarismtoday.com/2005/09/29/how-to-write-an-effective-dmca-notice/">does not state that URLs must be provided with a DMCA notice</a>, some hosts have that requirement and it is the most common way of identifying the original works. With the site down, that element could prove difficult.</p>
<p>Furthermore, since there would be no economic damages and it is unlikely that the work was registered, suing for said infringement would be almost financially impossible. In short, though stopping such plagiarism would be possible, it would be more difficult and less worth the time spent.</p>
<p>It may not be the perfect crime, but it is certainly pretty close. It&#8217;s very hard to envision a scraper paying for such an infringement, no matter how unethical or illegal it is.</p>
<p><strong>Prevention is the Key</strong></p>
<p>Since enforcing the copyrights of a dead site is, in many cases, impractical, the best approach is to look at preventing the infringement from happening in the first place. Doing that involves having a sound shut-down strategy that includes the following steps:</p>
<ol>
<li><strong>Handle all existing plagiarism cases:</strong> Deal with all ongoing plagiarism and content theft cases that you can. Ensure that the content is removed completely before beginning to take content offline. </li>
<li><strong>Block known archiving sites:</strong> Edit your robots.txt file to <a href="http://www.archive.org/about/exclude.php">exclude the Internet Archive</a> and other archiving services. All cached copies of your site should be removed within a few days. (Note: You can skip this step in favor of step three but it is worth being certain that these archiving sites drop your cached copies.)</li>
<li><strong>Block All Search Engines:</strong> Since search engine caches can remain active for some time after a site goes down, once you are certain the archive sites have removed the work, <a href="http://www.quickonlinetips.com/archives/2005/01/ways-to-prevent-search-engines-from-indexing-your-private-site/">edit the robots.txt file to exclude all search engines</a>. You can also <a href="http://www.seoconsultants.com/meta-tags/">use meta tags to prevent indexing</a>, or both to ensure that all spiders stop indexing your site. </li>
<li><strong>Move Content to a Hidden Location:</strong> Move a copy of your site to a hidden, but accessible, location. Consider making the site password protected to ensure that it can not be indexed or visited by anyone you do not personally authorize.</li>
<li><strong>Remove the Original Site:</strong> Take down the site and close up shop, being certain to leave behind a means of contact in the event a problem is discovered later.</li>
</ol>
<p>Though the process seems lengthy and difficult, it can easily be done over a couple of days with most of the time spent waiting on the spiders to re-index your site. Once they do, the cached copies should be dropped almost immediately. </p>
<p>The key thing in all of it is to ensure that the cached copies of your work are removed BEFORE you take down the site. As long as you ensure that the major caching services, especially the <a href="http://www.archive.org">Internet Archive</a>, have removed your work before you shut down, you can greatly reduce the risk of post-mortem plagiarism.</p>
<p><strong>Conclusions</strong></p>
<p>The good news is that, though there is no way to tell how common this kind of content theft is, it almost certainly is very rare. In most cases, the meager rewards do not justify the effort required. Also, when it does happen, it will largely limited to the larger, better-known sites that are now defunct as it requires at least some advanced knowledge of the site&#8217;s existence.</p>
<p>The bottom line is that the content on any Web site, even a closed one, still has value. This is especially true if the rightsholder is looking to start up another venture in the future or is pursuing off-line methods of distribution. The fact that the site is down does not mean it is acceptable to exploit the effort and expense of the original author.</p>
<p>Perhaps the strangest side effect of this is the damage it does to the classic classic scraper mantra &#8220;If you don&#8217;t want your content scraped, get off the Web&#8221;. If shutting down a site does not put a definite end to content theft, then dealing with plagiarism on a live sight becomes an even more practical solution.</p>
<p>In the end, it seems that the only sure fire way to avoid scraping is to not put it on the Web in the first place. As it sits right now, once the content is on the Web, there is no way to erase it.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.plagiarismtoday.com/2007/07/19/picking-a-dead-mans-pocket/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Page Caching using disk: enhanced

Served from: www.plagiarismtoday.com @ 2012-02-13 12:29:24 -->
