<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Plagiarism Todayspammers | Plagiarism Today</title>
	<atom:link href="http://www.plagiarismtoday.com/tag/spammers/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.plagiarismtoday.com</link>
	<description>Content Theft, Plagiarism, Copyright Infringement</description>
	<lastBuildDate>Mon, 13 Feb 2012 06:51:37 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Blogger CAPTCHA Cracked</title>
		<link>http://www.plagiarismtoday.com/2008/04/28/googles-blogspot-captcha-cracked/</link>
		<comments>http://www.plagiarismtoday.com/2008/04/28/googles-blogspot-captcha-cracked/#comments</comments>
		<pubDate>Mon, 28 Apr 2008 16:46:17 +0000</pubDate>
		<dc:creator>Jonathan Bailey</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[Prevention]]></category>
		<category><![CDATA[Blogger]]></category>
		<category><![CDATA[Blogspot]]></category>
		<category><![CDATA[Content-Theft]]></category>
		<category><![CDATA[Copyright-Infringement]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Plagiarism]]></category>
		<category><![CDATA[search spam]]></category>
		<category><![CDATA[Spam]]></category>
		<category><![CDATA[Spam-Blogs]]></category>
		<category><![CDATA[spammers]]></category>
		<category><![CDATA[Splogging]]></category>
		<category><![CDATA[Splogs]]></category>
		<category><![CDATA[web spam]]></category>

		<guid isPermaLink="false">http://www.plagiarismtoday.com/?p=989</guid>
		<description><![CDATA[Though it seemed as if Google was starting to make some headway into the spam blog problem on its Blogger service, the spammers seem to have turned the tide by cracking the CAPTCHA system and creating more accounts than ever before. ]]></description>
			<content:encoded><![CDATA[<p><img style=' float: left; padding: 4px; margin: 0 7px 2px 0;'  class="picleft alignleft size-medium wp-image-990" title="blogger-logo" src="http://www.plagiarismtoday.com/wp-content/uploads/2008/04/blogger-logo.jpg" alt="" width="250" height="81" />Google&#8217;s Blogger service, already one of the <a href="http://www.plagiarismtoday.com/2007/03/22/when-will-google-stand-up-to-spam/">largest sources of spam blogs on the Web</a>, is now being innundated with another wave of spammers following the <a title="Google CAPTCHA broken" href="http://www.thestandard.com/news/2008/04/25/spammers-ramp-siege-googles-blogger-bots">cracking of the Google CAPTCHA system</a>. This means that spammers can now fully automate the process of creating and setting up new Blogger spam blogs, making the process even faster and enabling the creation of more spam blogs than ever before.</p>
<p>Though these spam blogs will take many different approaches, inevitably, many of these spam blogs will use scraping as a means to fill their pages and appear more authentic to both Google the search engine and Google the host administrator.</p>
<p>Bloggers, especially those that frequently have spam-friendly keywords in their sites, should be aware of the likelihood of increased scraping on the Blogger service. Now would be an excellent time for sites that offer email subscriptions to <a title="Scraping Via Email" href="http://www.plagiarismtoday.com/2008/04/15/new-trend-scraping-via-email/">check for any @blogger.com accounts</a> and everyone to consider taking feed protection steps such as installing <a title="Antileech" href="http://redalt.com/Resources/Plugins/AntiLeech">Antileech</a>, creating a <a title="Feed Heater and Feed Footer" href="http://www.plagiarismtoday.com/2008/01/16/two-new-anti-scrpaing-wordpress-plugins/">feed header/footer</a> or using a <a title="Digital Fingerprint" href="http://www.plagiarismtoday.com/2006/10/05/update-digital-fingerprint-plugin-beta-2/">digital fingerprint</a>. </p>
<p>Sadly though it recently seemed as if Google was on the <a title="Google Attacks Spam" href="http://www.plagiarismtoday.com/2007/06/26/is-blogger-on-the-offensive-against-spam/">offensive against spam</a>, it now appears as if the tables have turned.</p>
<p>While the new spam wave is still ramping up, now is the best chance for bloggers to be aware of the issue and be prepared to <a title="Blogger DMCA" href="http://www.google.com/blogger_dmca.html">take action as needed</a>. Hopefully, Google will fix this issue soon and the impact of the problem will be limited.</p>
<p>If not, then Blogspot could easily become even more of a spam wasteland than before, making it even more difficult for legitimate bloggers to get noticed on the service and for Webmasters everywhere to keep their content out of spammer&#8217;s hands.  </p>
]]></content:encoded>
			<wfw:commentRss>http://www.plagiarismtoday.com/2008/04/28/googles-blogspot-captcha-cracked/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Picking a Dead Man&#8217;s Pocket</title>
		<link>http://www.plagiarismtoday.com/2007/07/19/picking-a-dead-mans-pocket/</link>
		<comments>http://www.plagiarismtoday.com/2007/07/19/picking-a-dead-mans-pocket/#comments</comments>
		<pubDate>Thu, 19 Jul 2007 16:23:39 +0000</pubDate>
		<dc:creator>Jonathan Bailey</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[DMCA]]></category>
		<category><![CDATA[Legal Issues]]></category>
		<category><![CDATA[Prevention]]></category>
		<category><![CDATA[archive]]></category>
		<category><![CDATA[caching]]></category>
		<category><![CDATA[Content-Theft]]></category>
		<category><![CDATA[Copyright]]></category>
		<category><![CDATA[Copyright-Infringement]]></category>
		<category><![CDATA[Internet-Archive]]></category>
		<category><![CDATA[meta-tags]]></category>
		<category><![CDATA[Plagiarism]]></category>
		<category><![CDATA[robots.txt]]></category>
		<category><![CDATA[RSS]]></category>
		<category><![CDATA[Scraping]]></category>
		<category><![CDATA[Spam]]></category>
		<category><![CDATA[spammers]]></category>
		<category><![CDATA[Splogs]]></category>

		<guid isPermaLink="false">http://www.plagiarismtoday.com/2007/07/19/picking-a-dead-mans-pocket/</guid>
		<description><![CDATA[If reading this site and enduring the onslaught of content theft, plagiarism and scraping is making you think about packing up and shutting down your site, you might want to think again. These days, even death does not put an end to content theft. It merely opens up new avenues for it. As a recent...]]></description>
			<content:encoded><![CDATA[<p>If reading this site and enduring the onslaught of content theft, plagiarism and scraping is making you think about packing up and shutting down your site, you might want to think again.</p>
<p>These days, even death does not put an end to content theft. It merely opens up new avenues for it.</p>
<p>As a recent article on <a href="http://www.bluehatseo.com/black-hole-seo-desert-scraping/">Blue Hat SEO pointed out</a>, nothing is ever really deleted from the Web. </p>
<p>Caching sites and archives hold on to your content long after the page has been removed and, as the article demonstrates, anything that is available can be scraped.</p>
<p>It is the equivalent of picking a dead man&#8217;s pocket, but it is a type of plagiarism that can and does happen. It is also a kind of plagiarism that <a href="http://www.nusuni.com/blog/2007/07/06/scraping-old-content-from-dead-sites-can-still-be-copyright-infringement-and-can-still-cause-seo-issues/#comment-10632">raises a whole slew of new questions</a> and concerns.</p>
<p><span id="more-547"></span><strong>How it is Done</strong></p>
<p>The process for scraping a dead site is surprisingly simple, involving only five steps.</p>
<ol>
<li>Visit an archiving site such as <a href="http://www.archive.org">The Internet Archive</a></li>
<li>Lookup an old site that you know to be deceased</li>
<li>Find an old, but still relevant, article</li>
<li>Check to make sure that the article has not been posted elsewhere</li>
<li>Copy/paste it onto your site</li>
</ol>
<p>As the original article points out, there are ways to automate this process, such as using sitemaps, but the process clearly works best by hand. With that in mind, this method is unlikely to be used by professional spam bloggers, who need more content than this method could provide, but could be favored by human-powered spam sites that are trying to appear legit.</p>
<p>To those Webmasters, the unethical ones catering to humans and the search engine bots, this type of content theft can seem like a dream come true, especially when one weighs the advantages that come with it.</p>
<p><strong>The Advantages for Scrapers</strong></p>
<p>When looking at why a scraper or a plagiarist would find this kind of content appealing, consider the following:</p>
<ol>
<li>Since the content has been removed from the search engines, there is no possible duplicate content penalty and no original site to compete with in the search engine rankings</li>
<li>If the original site is shut down, most likely, the people charged with protecting the content are no longer looking for infringements. Thus, the odds of finding legal trouble are slim to none.</li>
<li>Finally, there is a wealth of dead information out there, ready to be copied. Though the Web is growing, sites are also dying every day, thus ensuring a steady stream flowing into a large, growing pool.</li>
</ol>
<p>Of course, taking advantage of such a system requires a certain amount of cleverness and work ethic, things that are in short supply for most that would steal content, but if one sees the advantages as outweighing the drawbacks, it would still be an appealing option.</p>
<p>However, given the wide range of legitimate, easily-accessed content available to scrape, it seems unlikely that most infringers would take this route. </p>
<p>Despite that, there is little doubt that at least some do scrape dead sites for at least part of their content and that fact raises some difficult questions that don&#8217;t come with easy answers.</p>
<p><strong>Questions to Ponder</strong></p>
<p>Obviously, given the current length of copyright law, any such scraping would likely be an infringement. Unless the work was placed under a Creative Commons license or donated to the public domain, the work is almost certainly still protected and the copying is illegal.</p>
<p>However, with the business interest in protecting those works gone, any copyright is likely to go unenforced. Most won&#8217;t care to check for any infringement and those who do will likely be unaware of the potential danger. Most feel that, when the site is removed, the risk of plagiarism disappears.</p>
<p>Worse still, the authors of the work, who often created the works as a work for hire and hold no copyright interest in it, would be the ones holding the greatest continued stake in the work. Though their employers have long since closed up shop, their name and reputation remains affixed to the works.</p>
<p>But even if the copyright holder does detect such post-mortem plagiarism and expresses an interest in defending those works, going through the motions could be difficult. Though the DMCA <a href="http://www.plagiarismtoday.com/2005/09/29/how-to-write-an-effective-dmca-notice/">does not state that URLs must be provided with a DMCA notice</a>, some hosts have that requirement and it is the most common way of identifying the original works. With the site down, that element could prove difficult.</p>
<p>Furthermore, since there would be no economic damages and it is unlikely that the work was registered, suing for said infringement would be almost financially impossible. In short, though stopping such plagiarism would be possible, it would be more difficult and less worth the time spent.</p>
<p>It may not be the perfect crime, but it is certainly pretty close. It&#8217;s very hard to envision a scraper paying for such an infringement, no matter how unethical or illegal it is.</p>
<p><strong>Prevention is the Key</strong></p>
<p>Since enforcing the copyrights of a dead site is, in many cases, impractical, the best approach is to look at preventing the infringement from happening in the first place. Doing that involves having a sound shut-down strategy that includes the following steps:</p>
<ol>
<li><strong>Handle all existing plagiarism cases:</strong> Deal with all ongoing plagiarism and content theft cases that you can. Ensure that the content is removed completely before beginning to take content offline. </li>
<li><strong>Block known archiving sites:</strong> Edit your robots.txt file to <a href="http://www.archive.org/about/exclude.php">exclude the Internet Archive</a> and other archiving services. All cached copies of your site should be removed within a few days. (Note: You can skip this step in favor of step three but it is worth being certain that these archiving sites drop your cached copies.)</li>
<li><strong>Block All Search Engines:</strong> Since search engine caches can remain active for some time after a site goes down, once you are certain the archive sites have removed the work, <a href="http://www.quickonlinetips.com/archives/2005/01/ways-to-prevent-search-engines-from-indexing-your-private-site/">edit the robots.txt file to exclude all search engines</a>. You can also <a href="http://www.seoconsultants.com/meta-tags/">use meta tags to prevent indexing</a>, or both to ensure that all spiders stop indexing your site. </li>
<li><strong>Move Content to a Hidden Location:</strong> Move a copy of your site to a hidden, but accessible, location. Consider making the site password protected to ensure that it can not be indexed or visited by anyone you do not personally authorize.</li>
<li><strong>Remove the Original Site:</strong> Take down the site and close up shop, being certain to leave behind a means of contact in the event a problem is discovered later.</li>
</ol>
<p>Though the process seems lengthy and difficult, it can easily be done over a couple of days with most of the time spent waiting on the spiders to re-index your site. Once they do, the cached copies should be dropped almost immediately. </p>
<p>The key thing in all of it is to ensure that the cached copies of your work are removed BEFORE you take down the site. As long as you ensure that the major caching services, especially the <a href="http://www.archive.org">Internet Archive</a>, have removed your work before you shut down, you can greatly reduce the risk of post-mortem plagiarism.</p>
<p><strong>Conclusions</strong></p>
<p>The good news is that, though there is no way to tell how common this kind of content theft is, it almost certainly is very rare. In most cases, the meager rewards do not justify the effort required. Also, when it does happen, it will largely limited to the larger, better-known sites that are now defunct as it requires at least some advanced knowledge of the site&#8217;s existence.</p>
<p>The bottom line is that the content on any Web site, even a closed one, still has value. This is especially true if the rightsholder is looking to start up another venture in the future or is pursuing off-line methods of distribution. The fact that the site is down does not mean it is acceptable to exploit the effort and expense of the original author.</p>
<p>Perhaps the strangest side effect of this is the damage it does to the classic classic scraper mantra &#8220;If you don&#8217;t want your content scraped, get off the Web&#8221;. If shutting down a site does not put a definite end to content theft, then dealing with plagiarism on a live sight becomes an even more practical solution.</p>
<p>In the end, it seems that the only sure fire way to avoid scraping is to not put it on the Web in the first place. As it sits right now, once the content is on the Web, there is no way to erase it.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.plagiarismtoday.com/2007/07/19/picking-a-dead-mans-pocket/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>The Most Common Mistake in Plagiarism Fighting</title>
		<link>http://www.plagiarismtoday.com/2007/07/10/the-most-common-mistake-in-plagiarism-fighting/</link>
		<comments>http://www.plagiarismtoday.com/2007/07/10/the-most-common-mistake-in-plagiarism-fighting/#comments</comments>
		<pubDate>Tue, 10 Jul 2007 19:53:58 +0000</pubDate>
		<dc:creator>Jonathan Bailey</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[DMCA]]></category>
		<category><![CDATA[Legal Issues]]></category>
		<category><![CDATA[Content-Theft]]></category>
		<category><![CDATA[Copyright]]></category>
		<category><![CDATA[Copyright-Infringement]]></category>
		<category><![CDATA[Host]]></category>
		<category><![CDATA[Law]]></category>
		<category><![CDATA[Plagiarism]]></category>
		<category><![CDATA[RSS]]></category>
		<category><![CDATA[Scraping]]></category>
		<category><![CDATA[Spam]]></category>
		<category><![CDATA[spammers]]></category>
		<category><![CDATA[Splogs]]></category>
		<category><![CDATA[Web-Host]]></category>
		<category><![CDATA[Web-Hosting]]></category>

		<guid isPermaLink="false">http://www.plagiarismtoday.com/2007/07/10/the-most-common-mistake-in-plagiarism-fighting/</guid>
		<description><![CDATA[As the issue of plagiarism and content theft draws more and more attention on the web, in particular among bloggers, several Webmasters are posting their experiences with content theft and some intrepid writers are producing their own guides for fighting content theft on the Web. Though the attention to this issue is welcome, many of...]]></description>
			<content:encoded><![CDATA[<p>As the issue of plagiarism and content theft draws more and more attention on the web, in particular among bloggers, several Webmasters are posting their experiences with content theft and some intrepid writers are producing their own guides for fighting content theft on the Web.</p>
<p>Though the attention to this issue is welcome, many of these guides contain false, misleading or incomplete information. Though they are produced by smart, well-intentioned people, their errors can lessen the effectiveness of their strategies and, in some cases, expose the person following it to legal danger.</p>
<p>For example, one guide I encountered a year ago encouraged people to take their grievances public immediately and post them to a special forum. Not only is this time-consuming and unlikely to succeed in many cases, but it opens up the person doing the posting to a libel suit if their information is wrong.</p>
<p>However, such dangerous mistakes are relatively rare and are usually limited to small and obscure sites. Instead, the most common mistake made when crafting an anti-plagiarism strategy is something much more simple: Forgetting about the host.</p>
<p>As strange as it may sound, the most common omission in many of thse guides is the most effective tactic of all, getting the site shut down.</p>
<p><span id="more-534"></span><strong>Skipping a Step</strong></p>
<p>It seems that most guides on plagiarism fighting are pretty good at telling you ways to detect the infringement and to contact the plagiarist. Many provide stock cease and desist letters to send to the infringing Webmaster and advice on how to deal with different kinds of plagiarists.</p>
<p>However, should that step fail, a majority of these guides will then offer advice on how to get the content removed from the search engines or get the site&#8217;s advertising cut.</p>
<p>Though targeting advertisers can be a very effective way of dealing with profit-motivated plagiarism, such as with scrapers, neither that nor targeting search engines is as useful for deflecting the potential problems that come with being plagiarized as getting the site or the content removed.</p>
<p>Interestingly enough, many guides will include DMCA contact information and stock letters for contacting the search engine, but will completely omit any information about sending such a letter to the host.</p>
<p>Whenever I see such an omission, I comment on it and, in most cases, it is corrected fairly quickly. I have only seen a few such guides remain for a long period of time without this critical information.</p>
<p>Still, the frequency of this mistake has made me wonder why so many people overlook it. However, it didn&#8217;t take me long to think of a few potential answers as to why.</p>
<p><strong>The Hardest Button to Button</strong></p>
<p>The problem with filing a DMCA notice with a host is that it can be a very daunting challenge. Even if you have the template handy, you have to first know how to <a href="http://www.plagiarismtoday.com/stopping-internet-plagiarism/3-finding-the-host/">determine who the host is</a>, then, if they are in the U.S., locate their <a href="http://www.plagiarismtoday.com/dmca-contact-information/">DMCA contact information</a> and then contact via the means they specify.</p>
<p>That, in turn, requires a level of research many people are not comfortable with. If you are unfamiliar with networking tools, ill at ease reading through terms of use or only have limited knowledge about how the Internet works, sending a DMCA notice to a host can be a very daunting challenge.</p>
<p>Sending notices to Google and the other search engines, by comparison, is very easy. If you have the template in hand, there is only one page you need to know for each search engine. It is pretty trivial, from there, to send out the notices without doing any research and not wading into any uncomfortable waters.</p>
<p>However, this is dangerous for several reasons:</p>
<ol>
<li><strong> Doesn&#8217;t remove the content:</strong> Though it won&#8217;t turn up in the search engines you send the notice to, the plagiarized copies are still available on the Web and other search engines as well as human visitors can still access it.</li>
<li><strong>Turns the search engines into the copyright police:</strong> This concentrates all of the responsibility for policing copyright into three or four search engines. This was not the goal of the DMCA and it gives those companies too much power and responsibility in this matter. A change in policy of just one search engine could, potentially, have drastic implications on the Web.</li>
<li><strong>Can harm innocent bystanders:</strong> Search engine DMCA bans work differently from site to site but, in some cases, it is possible that more than the pages than intended can be banned, including pages written by other people.</li>
</ol>
<p>While there is definitely a place and a time for using search engine DMCA bans, immediately following a cease and desist is not it. Typically, I only turn to search engine bans when everything else has failed and I am prepared to give up.</p>
<p>In my mind, it is a way to do something when it seems nothing can be done at all.</p>
<p><strong>Correcting the Problem</strong></p>
<p>Since fewer people are comfortable with sending DMCA notices to host, fewer people use them. Since fewer people use them, fewer people write about them and that means that fewer people know about their existence.</p>
<p>This in a brutal cycle where more and more people get incomplete information. Not only does this lead people to use less effective tactics, but leads to mistakes down the road when they attempt to contact the host, often resulting in <a href="http://www.plagiarismtoday.com/2005/11/23/study-chronicles-dmca-abuses/">false or incomplete notices</a>.</p>
<p>The key, then, becomes to make sure that more people are aware this method of dealing with plagiarism and push them to take advantage of it. It also means working to ensure that they have the tools available to file a proper notice and send it to the correct person.</p>
<p>If we can do that, along with providing <a href="http://www.plagiarismtoday.com/your-copyrights-online/1-what-is-a-copyright/">basic copyright information</a>, we can go a long way to reducing the copyright drama that exists on the Web.</p>
<p><strong>Conclusions</strong></p>
<p>The vast majority of people who post anti-plagiarism guides are good, well-intentioned people that are trying to help others. Unfortunately, they are not always right and sometimes that advice can lead people astray.</p>
<p>It is important, when researching an anti-plagiarism strategy, not to just read one guide, but two or three. Don&#8217;t take any one person&#8217;s word, including mine, as gospel. Seek out other opinions, views and strategies. There is a constant dialog going on and, though I try to report on it, the Web is a big place and I don&#8217;t see absolutely everything.</p>
<p>Build your own strategy based upon your needs, time constraints and skills. When appropriate, experiment. If you learn something that works or see something new, share it with others and <a href="http://www.plagiarismtoday.com/contact-pt/">drop me a note as well</a>.</p>
<p>I am always on the lookout for new techniques and strategies in prevention, detection and cessation that can help myself and other Webmasters. Input and feedback is always appreciated.</p>
<p><em>Note: I have not linked any of the guides that inspired me to create this story. My goal with this is not to call anyone out or embarrass anyone. These are complicated issues and mistakes are understandable.  I want to encourage others to create more guides, not shame people that make simple mistakes. Furthermore, nearly all of the guides that I&#8217;ve seen with this error have since been fixed. </em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.plagiarismtoday.com/2007/07/10/the-most-common-mistake-in-plagiarism-fighting/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Scraping Starts from the Very First Post</title>
		<link>http://www.plagiarismtoday.com/2007/05/22/scraping-starts-from-the-very-first-post/</link>
		<comments>http://www.plagiarismtoday.com/2007/05/22/scraping-starts-from-the-very-first-post/#comments</comments>
		<pubDate>Tue, 22 May 2007 17:32:24 +0000</pubDate>
		<dc:creator>Jonathan Bailey</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[Legal Issues]]></category>
		<category><![CDATA[News]]></category>
		<category><![CDATA[Content-Theft]]></category>
		<category><![CDATA[Copyright]]></category>
		<category><![CDATA[Copyright-Infringement]]></category>
		<category><![CDATA[Plagiarism]]></category>
		<category><![CDATA[RSS]]></category>
		<category><![CDATA[Scraping]]></category>
		<category><![CDATA[Spam]]></category>
		<category><![CDATA[spammers]]></category>
		<category><![CDATA[Splogs]]></category>

		<guid isPermaLink="false">http://www.plagiarismtoday.com/2007/05/22/scraping-starts-from-the-very-first-post/</guid>
		<description><![CDATA[Rachel Radison is a New Orleans-based mortgage broker. After seeing the difficult home-buying climate in the city following Hurricane Katrina, she decided that she could help would-be buyers with her know how. Radison, having heard about these new Web sites called &#8220;blogs&#8221; decided to create one herself and obtained a WordPress.com account. However, her ignorance...]]></description>
			<content:encoded><![CDATA[<p>Rachel Radison is a New Orleans-based mortgage broker. After seeing the difficult home-buying climate in the city following Hurricane Katrina, she decided that she could help would-be buyers with her know how.</p>
<p>Radison, having heard about these new Web sites called &#8220;blogs&#8221; decided to <a href="http://rrkatrina.wordpress.com/">create one herself</a> and obtained a WordPress.com account. However, her ignorance about blogs quickly caught up with her.  Shortly after her first post, she checked her feed statistics and found that a whopping eleven people had subscribed to her feed.</p>
<p>She then quickly learned they weren&#8217;t interested in her feed at all, just her content.</p>
<p>Fortunately, Rachel Radison does not exist. She is a figment of my own imagination. I created both her and her site as an experiment to see both how common scraping is and how long it would take for scrapers to find a blog.</p>
<p>The answer surprised even me.</p>
<p><span id="more-494"></span><strong>Background</strong></p>
<p>The idea for the experiment came from an article on A Daily Rant. There, the owner took an ongoing WordPress.com blog and shut it down, leaving only a &#8220;this blog has been moved&#8221; post to let subscribers know. He then tracked the subscribers to the feed and found that, even after most subscribers had moved over, eighteen still remained.</p>
<p>The experiment was interesting but flawed. Many humans change RSS readers and often forget to unsubscribe from old feeds everywhere they&#8217;ve been. For example, I am certain there are many dead feeds in my old Rojo account.  It is entirely possible that, of the readers they had prior to the shut down, eighteen were simply old, but legitimate, RSS readers continuing to check the dead feed.</p>
<p>However, the idea seemed valid enough, take a dead feed and monitor the traffic on it. If you can set it up so that no humans should be subscribed to the feed, you can be reasonably sure that all traffic to the feed is from bots.</p>
<p>So, that is exactly what I did.</p>
<p><strong>The Experiment</strong></p>
<p>Using an old Gmail account, I created a new WordPress.com blog. I then gave the blog a theme and gave it a name &#8220;Rachel Radison&#8217;s Mortgage Blog&#8221;. I specifically chose mortgage a topic matter because it is both a spam-friendly keyword and it is a topic I have at least a little knowledge about after buying my new home.</p>
<p>In order to avoid being caught in <a href="http://www.plagiarismtoday.com/2007/04/09/why-wordpresscom-is-virtually-spam-free/">WordPress.com&#8217;s spam filters</a>, I decided to write the posts myself, using my extremely limited knowledge of the subject and a large amount of fluff. After a few moments of faking my way through the first post (being sure to add a footer disclaiming the site as rubbish), I made sure that the blog was going to ping all of the usual notification services and published it.</p>
<p>I then returned the next day to do another post but stopped off to check the feed stats, what I saw is below:</p>
<p style="text-align: center"><a href="http://www.plagiarismtoday.com/wp-content/uploads/2007/05/day12.png" title="day12.png"><img src="http://www.plagiarismtoday.com/wp-content/uploads/2007/05/day12.thumbnail.png" alt="day12.png" /></a></p>
<p>Even I was stunned by this. I had expected the feed to be scraped. But to have eleven &#8220;subscribers&#8221; after less than 24 hours was stunning.</p>
<p>Some of these subscribers could be easily explained. Technorati and Google both picked up the feed almost immediately. Likewise, WordPress seems to have its own crawler. However, no others have picked it up as of this writing and that leaves at least eight &#8220;subscribers&#8221; unaccounted for.</p>
<p><strong>But Wait, It Gets Worse</strong></p>
<p>I posted again on the seventeenth but, due to difficulties with my move, was very late in posting on the eighteenth. However, before I posted, I checked the feed stats again,  what I saw is below:</p>
<p style="text-align: center"><a href="http://www.plagiarismtoday.com/wp-content/uploads/2007/05/day3.png" title="day3.png"><img src="http://www.plagiarismtoday.com/wp-content/uploads/2007/05/day3.thumbnail.png" alt="day3.png" /></a></p>
<p>The second day had seen a whopping sixteen subscribers, though no new search engines had picked up the feed. Even more strange, day three had dropped to only three subscribers,  marking an over 80% drop in subscribers.</p>
<p>However, shortly after I posted the third, and final, article, the subscriber count more than doubled, reaching eight. Though still a marked drop from the day before, it showed that the feed subscribers were prompted by my posting and not creation of the blog.</p>
<p>To test this, I then let the blog lapse for a few days. Very quickly, the feed subscriber count completely flat-lined, reaching zero.</p>
<p style="text-align: center"><a href="http://www.plagiarismtoday.com/wp-content/uploads/2007/05/day7.png" title="day7.png"><img src="http://www.plagiarismtoday.com/wp-content/uploads/2007/05/day7.thumbnail.png" alt="day7.png" /></a></p>
<p> This does not follow a &#8220;human&#8221; pattern. Though feed counts rise and fall, as anyone at <a href="http://www.feedburner.com">FeedBurner</a> will tell you, but they do not follow this pattern. This is, almost certainly, the work of bots, both good and bad.</p>
<p>The outcome is pretty damming, it is obvious that there are at least some scrapers waiting for your site from the very first post and that being an unknown blogger is no protection against RSS abuse.</p>
<p><strong>Problems with the Study</strong></p>
<p>This isn&#8217;t to say that there aren&#8217;t problems with the study. There are several.</p>
<p>First and foremost, the study, by itself, means nothing. It is just one site on one service and on one topic. A more complete study would try more blogs on a variety of topics and services.</p>
<p>But a bigger problem is that I can not account for all of the subscribers of the feed.  I have done several searches for the scraper sites but have had no luck in locating them. Odds are, it is simply too early for them to have been picked up by the search engines. Even the original site is only in Technorati and Google as of this writing.</p>
<p>Also, with most search engines running spam filters, it is very likely that they would catch the scraped blogs before they were indexed. In fact, I have a feeling that several have even determined that the original blog is spam, which it technically is, and refused to index it as well, thus why Icerocket, Sphere, Yahoo! and others have not added the original either.</p>
<p>But even if we consider all of the legitimate blog search engines that would likely be looking at the feed, it doesn&#8217;t account for all of the subscribers.</p>
<p>That fact is further supported by the fact that WordPress could not identify most of the readers of the feed and, most that it did identify, were deemed &#8220;Web browsers&#8221;.</p>
<p>Also, as you can see in the image below, traffic to the site itself never reached anywhere near traffic to the feed (save today where I&#8217;ve been visiting the site). Most search engines, including Technorati, also visit the site when indexing the feed to ensure they get the full post (it is a way to guard against partial feeds limiting blog search engines).</p>
<p style="text-align: center"><a href="http://www.plagiarismtoday.com/wp-content/uploads/2007/05/day7visits.png" title="day7visits.png"><img src="http://www.plagiarismtoday.com/wp-content/uploads/2007/05/day7visits.thumbnail.png" alt="day7visits.png" /></a></p>
<p> If the uses of the feed were legitimate, then traffic to the site would, initially, either meet or exceed the traffic to the feed. With no long-term subscribers that may not visit the site every day, the fact that over a dozen &#8220;subscribers&#8221; accessed the feed without accessing the site is very suspicious.</p>
<p><strong>Conclusions</strong></p>
<p>Though it is hard to draw any solid conclusions from this study, there are at least three things that are obvious:</p>
<ol>
<li>Suspicious use of a feed begins, literally, with the first post.</li>
<li>Being an unknown blogger is no defense against scraping.</li>
<li>Spammers are basing much of their scraping on the notification services most blogs ping. If a pinged post has a keyword they are targeting, it seems that they then visit the feed to grab the content.</li>
</ol>
<p>What isn&#8217;t clear at this time, and likely won&#8217;t be until search engines update or drop their spam filtering, is how many of these suspicious visitors were truly scrapers. Almost certainly some were, but but some were also likely pinging services and search engines.</p>
<p>But if simple math is a clue and we believe that most legitimate services would also look at the site, it seems that the vast majority have less than honest intentions.</p>
<p>The bottom line is that something rotten is going on, it is just a matter of how rotten it is.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.plagiarismtoday.com/2007/05/22/scraping-starts-from-the-very-first-post/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Page Caching using disk: enhanced

Served from: www.plagiarismtoday.com @ 2012-02-13 02:20:26 -->
