<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Plagiarism Todaygoogle blog search | Plagiarism Today</title>
	<atom:link href="http://www.plagiarismtoday.com/tag/google-blog-search/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.plagiarismtoday.com</link>
	<description>Content Theft, Plagiarism, Copyright Infringement</description>
	<lastBuildDate>Mon, 13 Feb 2012 06:51:37 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>3 Count: Bad Fortuny</title>
		<link>http://www.plagiarismtoday.com/2009/04/20/3-count-bad-fortuny/</link>
		<comments>http://www.plagiarismtoday.com/2009/04/20/3-count-bad-fortuny/#comments</comments>
		<pubDate>Mon, 20 Apr 2009 14:38:21 +0000</pubDate>
		<dc:creator>Jonathan Bailey</dc:creator>
				<category><![CDATA[Copyright News]]></category>
		<category><![CDATA[authors guild]]></category>
		<category><![CDATA[cnn]]></category>
		<category><![CDATA[Content-Theft]]></category>
		<category><![CDATA[Copyright-Infringement]]></category>
		<category><![CDATA[craigslist]]></category>
		<category><![CDATA[DMCA]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[google blog search]]></category>
		<category><![CDATA[Internet-Archive]]></category>
		<category><![CDATA[jason fortuny]]></category>
		<category><![CDATA[Plagiarism]]></category>
		<category><![CDATA[takedown]]></category>
		<category><![CDATA[YouTube]]></category>

		<guid isPermaLink="false">http://www.plagiarismtoday.com/?p=3265</guid>
		<description><![CDATA[This is daily column on Plagiarism Today where the site brings you three of the days biggest, most important copyright and plagiarism news links. If you want to offer your feedback on the column, use the contact form or just follow me on Twitter at @plagiarismtoday. 1: Internet Archive wants book copyright indemnity like Google...]]></description>
			<content:encoded><![CDATA[<p><em>This is daily column on Plagiarism Today where the site brings you three of the days biggest, most important copyright and plagiarism news links. If you want to offer your feedback on the column, use the contact form or just follow me on Twitter at <a href="http://twitter.com/plagiarismtoday">@plagiarismtoday</a>.</em></p>
<h4>1:<br />
<a href="http://arstechnica.com/tech-policy/news/2009/04/internet-archive-wants-book-copyright-indemnity-like-google.ars">Internet Archive wants book copyright indemnity like Google</a></h4>
<p>The Internet Archive, famous for the Wayback Machine, <a href="http://web.archive.org/web/20051001001837/http://plagiarismtoday.com/">which displays old versions of Web sites</a>, is hoping that it can receive some of the same protections Google has with its recent Google Book Search settlement.</p>
<p>Google was sued in 2006 over its book scanning project by the Author&#8217;s Guild. The two recently reached a settlement where Google paid out over $125 million for the rights to continue scanning and displaying out of print books. The suit also protects Google in cases of orphan works, works where the author is unknown, should the author come forward later.</p>
<p>It is the orphan works issue that has the Internet Archive worried. They have written a letter to a Federal judge asking to intervene in the Google case, saying that the lack of such protection for them would put them at a disadvantage. </p>
<h4>2: <a href="http://news.slashdot.org/news/09/04/18/140214.shtml">$74k Judgment Against Craigslist Prankster </a></h4>
<p>Next up today, Jason Fortuny, the famous Craigslist prankster, has been ordered to pay a total of $74,000 to one of his victims. This includes damages for copyright infringement, damages for invasion of privacy as well as court costs and attorney fees.</p>
<p>Fortuny posted personal ads on Craiglist in various cities pretending to be a woman. As men responded to the ads, he posted their responses, including images, on a site he had created.</p>
<p>Unsurprisingly, he was sued for this behavior and, with the default judgment being filed, it appears that this case is over. </p>
<h4>3: <a href="http://copyrightsandcampaigns.blogspot.com/2009/04/cnn-makes-copyright-claim-on-video.html">CNN makes copyright claim on video critical of reporter&#8217;s &#8216;Tea Party&#8217; interviews</a></h4>
<p>Finally today, CNN seems to have become the latest to step in the &#8220;takedown to block unpleasant&#8221; speech mess on YouTube, taking down a clip of a reporter being combative with tea party protesters.</p>
<p>The clip had become very popular with conservative bloggers as many felt it showed CNN&#8217;s political bias. Though the clip was pulled it has been re-posted in several other places on YouTube as well as on other video sharing sites. </p>
<h4>Suggestions</h4>
<p>That&#8217;s it for the three count today, we&#8217;ll be back tomorrow with three more copyright links. If you have a link that you want to suggest a link for the column or have any proposals to make it better. Feel free to leave a comment or send me an email. I hope to hear from you. </p>
<h4>Want the Full Story?</h4>
<p>Tune in <a href="http://www.talkshoe.com/tc/22590">every Saturday morning for the live recording of the Copyright 2.0 Show</a> or wait and get the edited version <a href="http://www.plagiarismtoday.com/category/podcast/">Monday morning right here on Plagiarism Today</a>. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.plagiarismtoday.com/2009/04/20/3-count-bad-fortuny/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Pop Quiz 1: Know Your Google</title>
		<link>http://www.plagiarismtoday.com/2008/10/01/pop-quiz-1-know-your-google/</link>
		<comments>http://www.plagiarismtoday.com/2008/10/01/pop-quiz-1-know-your-google/#comments</comments>
		<pubDate>Wed, 01 Oct 2008 15:31:56 +0000</pubDate>
		<dc:creator>Jonathan Bailey</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[Content-Theft]]></category>
		<category><![CDATA[Copyright-Infringement]]></category>
		<category><![CDATA[game]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[google blog search]]></category>
		<category><![CDATA[Plagiarism]]></category>
		<category><![CDATA[quiz]]></category>

		<guid isPermaLink="false">http://www.plagiarismtoday.com/?p=1841</guid>
		<description><![CDATA[For those who enjoy a good quiz, today's post is a short seven-question quiz dealing with your content, Google and keeping both where you want them. ]]></description>
			<content:encoded><![CDATA[<table align="left" cellspacing=15>
<tr>
<td><a href="http://www.flickr.com/photos/98274023@N00/2417001179/" title="study." target="_blank"><img src="http://farm3.static.flickr.com/2168/2417001179_d5ce2d9362_m.jpg" alt="study." border="0" /></a><br /><small><a href="http://creativecommons.org/licenses/by-nd/2.0/" title="Attribution-NoDerivs License" target="_blank"><img src="http://www.plagiarismtoday.comwp-content/uploads/2008/10/cc.png" alt="Creative Commons License" border="0" width="16" height="16" align="absmiddle" /></a> <a href="http://www.photodropper.com/photos/" target="_blank">photo</a> credit: <a href="http://www.flickr.com/photos/98274023@N00/2417001179/" title="billaday" target="_blank">billaday</a></small></td>
</tr>
</table>
<p>In order to keep things interesting here, I&#8217;ve decided to start an irregular series dealing with your content, copyright law and all things related. This series, entitled &#8220;Pop Quiz&#8221; will ask seven questions in a particular area.</p>
<p>For those interested in playing, the goal is simple: To be the first to get all the questions right. </p>
<p>If you want to play, simply leave a comment below with your answers to the questions. I&#8217;ll confirm the correct answers as soon as practical. The prize  for winning, other than a feeling of lukewarm accomplishent, is a link to your home page in the follow-up post, if desired.</p>
<p>If you&#8217;re ready to give it a try, this week&#8217;s questions deal specifically with your content in Google and how to control where your content appears in Google.</p>
<p><span id="more-1841"></span>
<ol>
<li>What are three ways you can request your site be removed from Google search? (Note: This is your own site or a page on it, not an infringer)</li>
<li>What is the email address for the Google DMCA agent? (please only include the part before the @).</li>
<li>What, according to Matt Cutts, is the most effective way to report spam to Google?</li>
<li>What is the name of the Google service that will email you search results as they are picked up by Google?</li>
<li>What is the name of the Google spider, or rather, the name you need to refer to it as when you are trying to block it?</li>
<li>What is the meta tag command to prevent Google from displaying a sample of your content in their results pages?</li>
<li>What search command lets you see approximately how many pages Google has indexed on a site?</li>
</ol>
<p>Bear in mind that some of these are designed to be brain-dead simple and others are meant to be a bit more challenging. However, all of them are more than easily researched and looked up.</p>
<p>And yes, it is legal to use Google.</p>
<p>I hope you have fun with this and I look forward to your answers!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.plagiarismtoday.com/2008/10/01/pop-quiz-1-know-your-google/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Finding the Age of a Page</title>
		<link>http://www.plagiarismtoday.com/2008/06/06/finding-the-age-of-a-page/</link>
		<comments>http://www.plagiarismtoday.com/2008/06/06/finding-the-age-of-a-page/#comments</comments>
		<pubDate>Fri, 06 Jun 2008 15:52:16 +0000</pubDate>
		<dc:creator>Jonathan Bailey</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[Products]]></category>
		<category><![CDATA[Content-Theft]]></category>
		<category><![CDATA[Copyright-Infringement]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[google blog search]]></category>
		<category><![CDATA[Plagiarism]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[search spam]]></category>
		<category><![CDATA[Search-Engines]]></category>
		<category><![CDATA[seo]]></category>
		<category><![CDATA[Spam-Blogs]]></category>
		<category><![CDATA[Splogs]]></category>

		<guid isPermaLink="false">http://www.plagiarismtoday.com/?p=1254</guid>
		<description><![CDATA[If you need a quick and easy way to get an idea of when a post went life, there is a Firefox plugin that uses google to put that information just a click away.]]></description>
			<content:encoded><![CDATA[<p><IMG SRC="http://www.plagiarismtoday.com/images/linkdiagnosis-logo-20080606-104242.png" alt="Link Diagnosis Logo" align="left" class="picleft">One of the more difficult challenges on the Web is determining when a page was created. We simply can not trust the date and time stamps provided with the content we read as both good guys and bad guys alike <a href="http://www.plagiarismtoday.com/2008/05/27/spam-bloggers-who-backdate/" title="Spam Bloggers who Backdate">change the date of their posts as necessary</a>.</p>
<p>Search engines, however, can provide a much better set of statistics than a site&#8217;s own timestamps. The only issue is that gleaning the needed information can be difficult. Fortunately, a relatively new Firefox plugin entitled <a href="http://www.linkdiagnosis.com" title="Link Diagnosis">Link Diagnosis</a> helps with that by taking the dirty work out of determining when a page was indexed by Google.</p>
<p>The tool, while not perfect, can be a valuable asset when trying to determine approximately when a page appeared on the Web.<br />
<span id="more-1254"></span></p>
<h4>How it Works</h4>
<p><IMG SRC="http://www.plagiarismtoday.com/images/get-page-age-20080606-104402.png" alt="Get Page Age Screenshot"align="right" class="picright">Link Diagnosis is actually a robust plugin designed to analyze incoming links to a URL for SEO purposes. However, as one of its &#8220;hidden features&#8221; it is able to deteremine, approximately, <a href="http://blog.linkdiagnosis.com/?p=19" title="http://blog.linkdiagnosis.com/?p=19">the day the URL appeared in Google</a>.</p>
<p>It works simply by having the user right click the page they want to check, select the &#8220;Get Page Age&#8221; option and, after a few seconds they are greeted with a JavaScript popup containing the date the script detected the site appeared.</p>
<p>It works by using <a href="http://www.googletutor.com/2006/08/22/more-google-hacking-using-the-inurl-operator/" title="Google INURL">Google&#8217;s INURL command</a> which, when used in conjunction with a date filter, causes Google to display a date by each resulting URL. What the plugin does is take the URL you wish to check, create the search query and then automatically extract the applicable date, thus turning a multi-step process into a one-click solutions.</p>
<p>For anyone seeking to find out the date of a site, this could prove to be both a powerful tool and a good time saver as well.</p>
<h4>Why to Use It</h4>
<p>There are many reasons why you might want to check out the age of a particular page. </p>
<p>For one, you can use it to check if a spam blog or a plagiarist was indexed by Google before or after your original post (provided it was indexed at all). This can help determine what action you should take against the site. </p>
<p>However, many will also find its non-repudiation services to be very useful. If there ever is a dispute about who posted an article or an image first, this tool can help resolve it by providing an independent view on which went up first.</p>
<p>Though certainly not as accurate as <a href="http://www.numly.com">Numly</a> or <a href="http://www.myfreecopyright.com">MyFreeCopyright</a>, using Google is far more accurate than looking at the <a href="http://www.archive.org">Web Archive</a>, especially considering that the latter can take over six months to display any information about a URL.</p>
<p>Still, Link Diagnosis is still far from perfect in this area. there are many issues one will have if one tries to rely upon this for non-repudiation.</p>
<h4>Limitations</h4>
<p><IMG SRC="http://www.plagiarismtoday.com/images/page-age-capture-20080606-104544.png" alt="Get Page Age Error" align="left" class="picleft">Before you begin to make heavy use of this service bear in mind the following caveats:</p>
<p><OL><LI><strong>Google&#8217;s Limitations:</strong> The biggest issue of using the INURL method is that Google is not always index a site or a page immediately after it goes up. There are often delays. Also, the service can only work with pages already in the Google database, anything that has been blacklisted, either by the creator or by Google, will return no results.</LI><br />
<LI><strong>URLs and Not Content:</strong> The function will tell you when the URL appeared in Google, not the content on the page. For permalinks that may be acceptable but dynamic pages, such as the front page of Plagiarism Today, it can create a problem.</LI><br />
<LI><strong>Different Owners:</strong> Also, the system detects when a URL was first indexed by Google, not who owned it at the time. If a site changes ownership, even if it is taken out of Google during the transition, the date shown for the home page will be long to the original owner. </LI></OL></p>
<p>In short, the tools is subject to the exact same gaming and manipulation that Google and the other search engines are. As such, it can provide some quick and dirty information, especially on permalinks, but should never be taken as the ultimate gospel on the age of a page.</p>
<p>Link Diagnosis is no substitute for a true non-repudiation service and it does not claim to be.</p>
<h4>Conclusions</h4>
<p>Personally, I find the other features of Link Diagnosis much more compelling than its &#8220;page age&#8221; feature. Though it is great for a quick analysis, especially of a spam blog permalink, it may not always tell the complete truth or have the information you are seeking.</p>
<p>It is a great analysis tool but it should not be assumed to be the plain truth. There are plenty of ways that it could be wrong.</p>
<p>So, as with every tool, be sure to use it in conjunction with common sense and logic. Have it available, use it if needed, but don&#8217;t use it as a replacement for your own judgment.</p>
<p>No tool is that powerful.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.plagiarismtoday.com/2008/06/06/finding-the-age-of-a-page/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Why Most Spam Blogs are American</title>
		<link>http://www.plagiarismtoday.com/2008/05/13/why-most-spam-blogs-are-american/</link>
		<comments>http://www.plagiarismtoday.com/2008/05/13/why-most-spam-blogs-are-american/#comments</comments>
		<pubDate>Tue, 13 May 2008 16:12:36 +0000</pubDate>
		<dc:creator>Jonathan Bailey</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[DMCA]]></category>
		<category><![CDATA[Legal Issues]]></category>
		<category><![CDATA[Blogspot]]></category>
		<category><![CDATA[Content-Theft]]></category>
		<category><![CDATA[Copyright-Infringement]]></category>
		<category><![CDATA[EU]]></category>
		<category><![CDATA[google blog search]]></category>
		<category><![CDATA[Plagiarism]]></category>
		<category><![CDATA[Spam]]></category>
		<category><![CDATA[Spam-Blogs]]></category>
		<category><![CDATA[Splogging]]></category>
		<category><![CDATA[Splogs]]></category>
		<category><![CDATA[United-States]]></category>

		<guid isPermaLink="false">http://www.plagiarismtoday.com/?p=1042</guid>
		<description><![CDATA[With the Internet becoming more international in every regard and laws in the U.S. turning against spammers, it seems odd that so many spammers are still concentrated within the United States. However, the reasons are simple to understand. ]]></description>
			<content:encoded><![CDATA[<p><IMG SRC="http://www.plagiarismtoday.com/wp-content/uploads/2008/05/blogger-logo2.jpg" alt="Blogger Logo" align="left" class="picleft">On the surface, the United States seems to be a hostile location for spam blogs. The DMCA makes it trivial for copyright holders to get their content removed from sites hosted in the U.S, anti-spam laws make things uncomfortable for email spammers, which often go hand in hand with Web spammers, and most hosts are very cooperative in removing junk from their servers.</p>
<p>Yet, out of the last 40 spam blogs I&#8217;ve looked at, 32 of them were located within the United States. The few that remained were in countries such as Iran, Russia and Ukraine, where copyright law makes it hard to stop them.</p>
<p>So why are so many spam blogs American despite the obstacles to setting up shop in the United States? The answers are surprisingly simple but do not bode well for the future of fighting spam here or abroad.<br />
<span id="more-1042"></span></p>
<h4>An American Tradition</h4>
<p>The beautiful thing about the Internet is that it is possible to set up shop just about anywhere in the world and, in turn, have anyone else in the world come visit you. Geographic borders are meaningless, until you look at legal issues.</p>
<p>Legally-speaking though, the U.S., much like the EU and similar legal climates, seem to be very hostile places for spammers. Not only does copyright law give easy recourse for those who are scraped, but hosts are very well versed at dealing with spam and often take action without prompting.</p>
<p>It seems logical that spammers would start to take their operations and move them overseas, into countries with weaker laws and enforcement. A stable home would mean longer-running spam operations, that would mean more trust with the search engines and, theoretically, more money.</p>
<p>But despite this, spam seems to be an almost purely American tradition. American hosts are rife with spam blogs and there seems to be no rush on the part of spammers to move to other nations. Despite greener pastures, the purveyors of junk are fine working with in the United States and, when one looks at the reasons, it is clear why that isn&#8217;t going to change any time soon.</p>
<h4>Sticking Around</h4>
<p>Spammers aren&#8217;t staying within the U.S. to make it easier for American bloggers to shut them down, rather, they have their own interests in mind. Consider the following: </p>
<p><OL><LI><strong>Cost:</strong> Hosting is extremely cheap in the United States and, with the dollar falling, it is only getting cheaper. For six dollars per month you can get a <a href="http://www.dreamhost.com/hosting.html" title="Dreamhost Hosting">hosting account that lets you host unlimited domains</a>. It is cheaper to risk having accounts cut than to pay a premium for hosting elsewhere.</LI></p>
<p><LI><strong>Free Hosts:</strong> The vast majority of free hosts are located within the United States. Spammers that want to target sites such as Blogspot are pretty much forced to stay within the country.</LI></p>
<p><LI><strong>Still Vulnerable to DMCA:</strong> Even if a spam blog sets up shop in another country, they would still be vulnerable to American laws with regards to the search engines. Since Google, Yahoo!, MSN and Ask are all American companies, a copyright holder can still use the DMCA to effectively blacklist them from search.</LI></p>
<p><LI><strong>Search Engine Trust:</strong> Though I have not been able to find evidence of this, it seems only logical that search engines would put more trust into sites closer to their searchers. A site hosted in Iran may do well in Iranian searches, but would not likely perform well in the bulk of search results. </LI></p>
<p><LI><strong>Cooperative Hosts:</strong> Despite the laws in the U.S. that prohibit this, there are still no shortage of lesser-known hosts that will gladly turn a blind eye to spam and copyright infringement. These hosts can typically get away with it because U.S. law <a href="http://www.plagiarismtoday.com/2008/01/11/why-your-copyright-is-second-rate/" title="Why Your Copyright is Second Rate">makes it so difficult to sue for copyright infringement</a>. </LI></OL></p>
<p>The end result is that the life of a U.S.-based spammer is far from easy, but it certainly is fruitful and, with a little bit of creativity and effort, the obstacles can be easily overcome.</p>
<h4>Mitigating the Problem</h4>
<p>With the current legal and hosting climate in the United States, there is very little Webmasters can do to outright stop this problem. However, there are steps that we can all take to minimize this issue.</p>
<p><OL><LI><strong>Use The Laws:</strong> The fact that most spam is within the United States is something of a gift. Most hosts respect the DMCA and will remove infringing works. Taking advantage of that is a powerful start. The DMCA has been around in the U.S. for ten years and most hosts have effective policies for dealing with such notices.</LI></p>
<p><LI><strong>Target Search Engines:</strong> Hosts that are uncooperative need to be pointed out to the search engines in the form of <a href="http://www.plagiarismtoday.com/2008/03/21/the-best-way-to-report-spam-to-google/" title="Best Way to Report Google Spam">spam reports</a>. The reason is that, if enough spam is reported in an IP range, Google will start to distrust the entire host and that will affect their bottom line, both driving away spammers and legitimate customers.</LI></p>
<p><LI><strong>Use Search Engine DMCAs:</strong> Though <a href="http://www.plagiarismtoday.com/2006/06/02/google-the-dmca-and-you/" title="Google and the DMCA">working with Google may be tricky</a>, once a spam report has been filed sending a DMCA notice to the search engines can decapitate the spam attack, making it useless. </LI></OL></p>
<p>These are not perfect solutions, but they are ways that everyday Webmasters can hit back at the spam problem without having to go through the hassle of finding an attorney.</p>
<h4>Conclusions</h4>
<p>The U.S. is in no danger of being dethroned as the Web spam king. Though spammers will diversify as the Web becomes even more international, like wolves hiding in a flock, they will follow the rest of us in a bid to fit in.</p>
<p>But as frustrating as this is, it does serve to our advantage and give us some powerful tools to fight back. All it takes is the no-how and willingness to take action.</p>
<p>Fortunately, it seems that more and more are fed up with Web spam and are doing something about it.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.plagiarismtoday.com/2008/05/13/why-most-spam-blogs-are-american/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Workfriendly: Yet Another Issue</title>
		<link>http://www.plagiarismtoday.com/2008/04/08/workfriendly-yet-another-issue/</link>
		<comments>http://www.plagiarismtoday.com/2008/04/08/workfriendly-yet-another-issue/#comments</comments>
		<pubDate>Tue, 08 Apr 2008 14:56:34 +0000</pubDate>
		<dc:creator>Jonathan Bailey</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[Personal Experiences]]></category>
		<category><![CDATA[Prevention]]></category>
		<category><![CDATA[Content-Theft]]></category>
		<category><![CDATA[Copyright-Infringement]]></category>
		<category><![CDATA[errors]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[google blog search]]></category>
		<category><![CDATA[Plagiarism]]></category>
		<category><![CDATA[Scraping]]></category>
		<category><![CDATA[technorati]]></category>
		<category><![CDATA[workfriendly]]></category>

		<guid isPermaLink="false">http://www.plagiarismtoday.com/?p=887</guid>
		<description><![CDATA[Workfriendly, a script that masks the Web to look like an open Microsoft Word document, may have been created as a joke, but it continues to create serious problems for the Webmasters that it scrapes. ]]></description>
			<content:encoded><![CDATA[<p><img class="picleft" style="float: left;" src="http://www.plagiarismtoday.com/wp-content/uploads/2008/04/workfriendlylogo.png" alt="WorkFriendly Logo" width="185" height="36" />Back in November of last year, I wrote an article about <a title="WorkFriendly" rel="nofollow" href="http://www.workfriendly.net">Workfriendly</a>, calling it an &#8220;<a title="WorkFriendly as an Accidental Scraper" href="http://www.plagiarismtoday.com/2007/11/09/workfriendly/">accidental scraper</a>&#8221; and accusing the site of allowing search engines to index pages containing scraped content.</p>
<p>The site, which is simply a script that <a href="http://www.diylife.com/2008/03/17/surf-the-web-without-your-boss-knowing/">modifies other sites</a> to look like a document in Microsoft Word, so that one can surf the Web at work without raising suspicion, has <a title="Google Results for WorkFriendly" href="http://www.google.com/search?q=site%3Aworkfriendly.net&amp;ie=utf-8&amp;oe=utf-8&amp;aq=t&amp;rls=org.mozilla:en-US:official&amp;client=firefox-a">nearly a quarter of a million URLs referenced in Google</a>, even though only one page, the home page, contains original content.</p>
<p>However, I recently discovered that Workfriendly has another issue with it, one that causes, in some cases, both users and the search engines to seek out nonexistant URLs, causing 404 errors in very large numbers.</p>
<p>Though it is a problem caused by Workfriendly, it is one that Webmasters and bloggers need to take action to correct if they are vulnerable. Otherwise, the search engines could be steered toward hundreds of non-working URLs on your site, potentially hurting your ranking in them.<br />
<span id="more-887"></span></p>
<h4>Discovering the Problem</h4>
<p><img class="picright" src="http://www.plagiarismtoday.com/wp-content/uploads/2008/04/workfriendlysucks21.jpg" border="0" alt="workfriendlysucks2.jpg" width="267" height="275" align="right" />I discovered the problem with Workfriendly over the weekend by accident. I logged into my Google Webmaster Tools account to check on any errors I had and was stunned to find over 150 file not found errors.</p>
<p>WordPress typically does a pretty good job avoiding file not found errors so to discover so many on my site, especially with no other errors found, was surprising.</p>
<p>Thinking that, perhaps, my recent update had caused an issue with my permalinks, I looked at the errors themselves. One was caused by me changing the date on a post, another was a server error where the URL worked fine, but the other 149 pointed to a directory that does not and has never existed on this server &#8220;/browse/Office2003Blue/&#8221;.</p>
<p><img src="http://www.plagiarismtoday.com/wp-content/uploads/2008/04/workfriendlysucks3-2.png" border="0" alt="workfriendlysucks3_2.png" width="550" height="202" /></p>
<p>I remembered that Workfriendly used a similar link structure when you browsed the Web through it. I hopped onto the site and pulled up Plagiarism Today and watched as Workfriendly pulled up the site successfully. Clearly, the ban I had put in place a few months ago had stopped working, likely due to the plugin I was using not being compatible with newer version of WordPress.</p>
<p><img class="picleft" src="http://www.plagiarismtoday.com/wp-content/uploads/2008/04/workfriendlysuck7.jpg" border="0" alt="workfriendlysuck7.jpg" width="368" height="205" align="left" />After pulling up Plagiarism Today in Workfriedly, I hovered my mouse over one of the links and looked at the URL, indeed, it was pointing to URLs on this server in the non-existant &#8220;browse&#8221; directory. Clicking the link resulted in chaos in Workfriendly and, in most cases, led to the site loading up without Workfriendly&#8217;s obfuscation.</p>
<p>I immediately set out to block Workfriendly, this time using a hand-coded <a title="How to Block Scrapers with .htaccess" href="http://www.plagiarismtoday.com/2007/07/02/using-htaccess-to-stop-content-theft/">.htaccess block</a>, but not before trying to figure out what was causing the problem.</p>
<h4>Understanding the Issue</h4>
<p>What made the problem perplexing was that it seemed to only be this site that was having the issue. Other sites I tested with Workfriendly worked fine.</p>
<p>However, after I looked at the source code for the page that Workfriendly created, the problem became almost immediately clear.</p>
<p>Plagiarism Today uses a &#8220;base&#8221; meta tag. It is a tag used to tell search engines and Web browsers what the &#8220;base&#8221; URL of your site is so that, when you use relative links (links that do not begin with an &#8220;http://&#8221;), the browser knows what URL you are pointing to.</p>
<p>It is a good practice for SEO reasons and to help with <a title="Preventing 302 Hijacking" href="http://www.plagiarismtoday.com/2007/06/14/302-hijacking-an-old-danger-made-new-again/">preventing 302 hijacking</a>. Still, most sites do not have one and, in many cases, it isn&#8217;t necessary.</p>
<p>The problem was that Workfriendly, despite having manipulated all of the links on my site, was using relative links for everything. Rather than saying &#8220;http://www.workfriendly.net/browse/&#8230;&#8221; the links simply said &#8220;/browse/&#8230;&#8221;.</p>
<p>When it was combined with the base tag by the browser, that converted all of the links to &#8220;http://www.plagiarismtoday.com/browse/&#8230;&#8221;, a link that does not exist.</p>
<p>The combination of the base tag and Workfriendly&#8217;s use of relative links was causing the site to throw back URLs that did not exist and, due to the poor use of robots.txt, causing the search engines to pick up those bad links as well.</p>
<h4>An Inconsiderate Script</h4>
<p>My issue with Workfriendly has never been the service itself. Though some could argue that it creates a derivative work of the sites it processes, since the works are never saved, but are rather created dynamically, it is a difficult case to make.</p>
<p>However, more to the point, I am not upset about sites that want to remix or alter the site to make it easier to read. I would not oppose a version better suited for the visually impaired, for mobile browsers or other formats as needed, so long as the site showed basic respect for the content it was displaying.</p>
<p>And that is the problem with Workfriendly. The service shows no consideration for the Webmasters whose content it uses.</p>
<p>For one, the site allows the search engines to index the scraped pages, even though the pages do not exist and are, instead, dynamically-generated.</p>
<p>Second, sloppy programming on the site causes it to generate artificial 404 errors that could hurt Webmasters when dealing with the search engines. Fortunately though, since the bad links are on an external site, they likely won&#8217;t have much impact.</p>
<p>However, if Workfriendly had simply used a correct link format, including the &#8220;http://www.workfriendly.net&#8221; before each link or stripping out the Base tag, the issue would not be a problem at all.</p>
<p>But what is perhaps strangest of all is that Workfriendly offers you a script that you can put on your site to direct your visitors to their version of your site. However, in addition to letting your visitors use the Workfriendly service, you may be helping the search engines find your content in their links.</p>
<p>It seems unlikely that is worth the trade off.</p>
<h4>Conclusions</h4>
<p><img src="http://www.plagiarismtoday.com/wp-content/uploads/2008/04/workfriendlysucks5-1.jpg" border="0" alt="workfriendlysucks5-1.jpg" width="250" height="163" align="right" />Personally, I decided it was time to be done with Workfriendly. I edited my .htaccess file and have banned the server from accessing this site. So far it is the only IP to be completely banned from this domain. If you attempt to access the site from Workfriendly, you will get the message displayed to the right.</p>
<p>If anyone is looking for the code I added to my .htaccess file, I simply put this before any of my WordPress code:</p>
<blockquote><p>order allow,deny<br />
deny from 66.226.27.21<br />
allow from all</p></blockquote>
<p>This certainly isn&#8217;t the type of steps I wanted to take, but it was I felt I was forced to do and, sadly, what I have to encourage others to look at doing to.</p>
<p>But the problem is that, in their bid to create something simple and fun, the creators of Workfriendly made something that poses a real danger to Webmasters and bloggers. Though simple changes to the system could remedy these problems easily, the authors have either neglected or refused to do so.</p>
<p>The result, on this site at least, is that Workfriendly is banned. I have attempted to contact the creators several times in the past but have never received a response. Considering all of the attention that has been paid to scraping issue, it seems that either the creators are ignoring the criticism, or have abandoned the project.</p>
<p>Either way, right now Workfriendly is just another problem for Webmasters and bloggers to worry about.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.plagiarismtoday.com/2008/04/08/workfriendly-yet-another-issue/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>The Best Way to Report Spam to Google</title>
		<link>http://www.plagiarismtoday.com/2008/03/21/the-best-way-to-report-spam-to-google/</link>
		<comments>http://www.plagiarismtoday.com/2008/03/21/the-best-way-to-report-spam-to-google/#comments</comments>
		<pubDate>Fri, 21 Mar 2008 14:51:44 +0000</pubDate>
		<dc:creator>Jonathan Bailey</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[Videos]]></category>
		<category><![CDATA[Content-Theft]]></category>
		<category><![CDATA[Copyright-Infringement]]></category>
		<category><![CDATA[google blog search]]></category>
		<category><![CDATA[google-video]]></category>
		<category><![CDATA[Plagiarism]]></category>
		<category><![CDATA[search spam]]></category>
		<category><![CDATA[Search-Engines]]></category>
		<category><![CDATA[Spam-Blogs]]></category>
		<category><![CDATA[Splogging]]></category>
		<category><![CDATA[Splogs]]></category>

		<guid isPermaLink="false">http://www.plagiarismtoday.com/2008/03/21/the-best-way-to-report-spam-to-google/</guid>
		<description><![CDATA[Many complain that it is very difficult to get Google to take action on reported spam blogs. However, a simple trick may make it easier to get the search engine's attention when reporting junk content. ]]></description>
			<content:encoded><![CDATA[<p><img SRC="http://aycu34.webshots.com/image/48033/2000709108004570234_rs.jpg" alt="Google Webmaster Tools Image" align="left" class="picleft"/>I was going through videos of past WordCamp presentations to <a href="http://dallas.wordcamp.org/schedule/">prepare for my own next week</a> and found myself <a href="http://onemansblog.com/2007/08/04/matt-cutts-lecture-whitehat-seo-tips-for-bloggers/">re-watching a presentation</a> by Google&#8217;s <a href="http://www.mattcutts.com/blog/">Matt Cutts</a> that he gave at WordCamp San Francisco in 2007.</p>
<p>At the forty minute mark in the presentation, Cutts said something that was interesting to those of us who deal with spam blogs but has been largely overlooked. When discussing <a href="http://www.google.com/webmasters/">Google&#8217;s Webmaster Center</a>, he mentioned that you can report spam through their Webmaster Tools feature and that they &#8220;give more weight&#8221; to those reports than the ones made through <a href="http://www.google.com/contact/spamreport.html">their public form</a>.</p>
<p>In short, if you have access to Google&#8217;s Webmaster Tools, which is free and easy to register for, you can use the form in there to file a more meaningful spam report. Best of all, the form is identical to the public one and and should not seem foreign to anyone used to filing spam reports.</p>
<p><img SRC="http://aycu30.webshots.com/image/47549/2006358667072020216_rs.jpg" alt="How to report Google Spam"align="right" class="picright"/>This is assumedly because the spam form in the Webmaster Tools is not anonymous, unlike the public one. Google, understandably, gives more significance to reports where they know the party providing the information.<br />
<span id="more-855"></span><br />
To file the report, simply log into the Webmaster tools dashboard and click the &#8220;Report spam in our index&#8221; link on the right hand side. You report paid links.</p>
<p>This may resolve many of the <a href="http://www.webmasterworld.com/forum30/32931.htm">claims that</a> Google <a href="http://www.quickonlinetips.com/archives/2006/07/how-to-complain-and-report-spam-blogger-blogs/">does not respond</a> (see comments) to spam reports.</p>
<p>All in all, while this is a very simple trick, it might help with the reporting of spam in cases where a DMCA notice is simply not practical.</p>
<p><strong>Note:</strong> In a strange coincidence, I found the video on <a href="http://onemansblog.com/">John Pozadzides blog</a>, who will be speaking directly before me at WordCamp Dallas. </p>
<p><object classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" width="437" height="370" id="viddler"><param name="movie" value="http://www.viddler.com/player/34fc548d/" /><param name="allowScriptAccess" value="always" /><param name="allowFullScreen" value="true" /><embed src="http://www.viddler.com/player/34fc548d/" width="437" height="370" type="application/x-shockwave-flash" allowScriptAccess="always" allowFullScreen="true" name="viddler" ></embed></object></p>
]]></content:encoded>
			<wfw:commentRss>http://www.plagiarismtoday.com/2008/03/21/the-best-way-to-report-spam-to-google/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		</item>
		<item>
		<title>Spam Blogs and AdSense Dollars</title>
		<link>http://www.plagiarismtoday.com/2008/03/13/spam-blogs-and-adsense-dollars/</link>
		<comments>http://www.plagiarismtoday.com/2008/03/13/spam-blogs-and-adsense-dollars/#comments</comments>
		<pubDate>Thu, 13 Mar 2008 16:51:46 +0000</pubDate>
		<dc:creator>Jonathan Bailey</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[News]]></category>
		<category><![CDATA[Prevention]]></category>
		<category><![CDATA[Adsense]]></category>
		<category><![CDATA[Content-Theft]]></category>
		<category><![CDATA[Copyright-Infringement]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[google blog search]]></category>
		<category><![CDATA[Plagiarism]]></category>
		<category><![CDATA[Scraping]]></category>
		<category><![CDATA[Spam-Blogs]]></category>
		<category><![CDATA[Splogging]]></category>
		<category><![CDATA[Splogs]]></category>

		<guid isPermaLink="false">http://www.plagiarismtoday.com/2008/03/13/spam-blogs-and-adsense-dollars/</guid>
		<description><![CDATA[Adsense keywords can fluctuate wildly in price. So what happens when a keyword related to your site becomes the subject of a bidding war? One Austin attorney found out.]]></description>
			<content:encoded><![CDATA[<p><img SRC="http://aycu37.webshots.com/image/47516/2004485230437834245_rs.jpg" align="left" class="picleft"/><a href="http://www.austindefense.com/">Jamie Spencer</a> is an attorney from Austin, TX who recently discovered that one of her blogs was <a href="http://blog.austindefense.com/2008/03/articles/other-blogs/splogs-plagiarizing-for-money/">being scraped heavily by spammers</a>. </p>
<p>However, unlike most victims of scraping, she knew exactly why her blog was being targeted heavily, she had recently read an article about the <a href="http://www.cwire.org/highest-paying-search-terms">highest-paying AdSense keywords</a> and knew that &#8220;Austin DWI&#8221; was commanding, at that time, well over $80.00 per click.</p>
<p>The results were predictable. As the term rocketed up in value, scrapers and other spam bloggers settled in to try and lure visitors for lucrative clicks and, along the way, ended up grabbing her keyword-rich content.</p>
<p>Her story illustrates a strange problem when dealing with spam blogs, that it is impossible to predict exactly where and what they will seize upon. They are, like anything else on the Web, highly prone to trends and changes in the climate making it virtually impossible to guess what will and will not be targeted tomorrow.<br />
<span id="more-844"></span></p>
<h4>EBay on Steroids</h4>
<p>According to Spencer, the spike in the price of the keyword was caused by a &#8220;bidding war&#8221; between attorneys in the Austin area. Since many lawyers in the region use Google to promote and they each wanted to be number one in the bidding, like any other online auction, they kept raising the price for the keyword until the cost they paid was far higher than what the term was actually worth.</p>
<p>The keyword, according to CyberWire, has come down in price, to approximately $55, still well above the average keyword cost and still in the top twenty.</p>
<p>However, these kinds of bidding wars can happen anywhere to almost any keyword. All that it takes is for two or three well-financed advertisers to seek out the top spot and keep trumping one another until they drive the price up past what the keyword is worth. </p>
<p>Though these bidding wars usually flare up and flame out quickly, they can last more than long enough for spammers to target the term and start scraping related content. This means that, if you happen to have a blog or site in the area, while your AdSense revenue may go up if you run ads, your content is likely to be scraped and reposted much more than usual, putting your search engine ranking in jeopardy.</p>
<p>The end result, your site can, almost overnight, go from experiencing only a minimal amount of content theft to being a prime target.</p>
<p>All it takes is a bidding war between a couple of advertisers in your field.</p>
<h4>The Good News</h4>
<p><img SRC="http://aycu09.webshots.com/image/48608/2004412623190039047_rs.jpg" align="right" class="picright"/>Fortunately, while these bidding wars seem to be fairly common, they rarely lead to increased scraping. </p>
<p>The first reason for that is that the bidding rarely reaches heights that would attract the attention of scrapers. Though a leap in price from one dollar to four dollars per click might quadruple your AdSense revenue while the bidding war is going on, it will not attract many new spammers as there are still many keywords that are much more valuable.</p>
<p>The second reason is that spammers do not base their decision on what to target by keyword cost alone. If you look at the list of <a href="http://www.cwire.org/highest-paying-search-terms">most expensive keywords</a>, you see that the vast majority are for attorneys. But while spam about legal issues is present and common, it pales in comparison to gambling, pornography and other traditional spam targets.</p>
<p>The reason for this is that high-paying keywords are worthless if the terms are not regularly searched for and visitors are not likely to click the ads. A spammer can make more money off of fifty $2 clicks than just one $80 click.</p>
<p>The keywords that spammers target will have a balance between price, search frequency and click ratio. A bidding war may motivate spammers to target a borderline term they wouldn&#8217;t have otherwise, but only in the most extreme conditions would cause them to take heavy interest in a keyword that was, previously, completely ignored.</p>
<h4>Conclusions</h4>
<p>The bottom line for bloggers is three-fold.</p>
<ol>
<li><strong>All Blogs are Scraped:</strong> All blogs, regardless of their content and popularity, are going to be scraped at some point. It is an inevitability. </li>
<li><strong>Some Blogs Will Be Targeted:</strong> Some blogs, such as those in traditional spam fields, will be heavily targeted as their content is more appealing. They may see many times the scraping of blogs in other fields.</li>
<li><strong>Price Fluctuations Can Impact Scraping:</strong> Finally, price fluctions, both up and down, can impact the amount of scraping your site sees. Though the price is not the primary determiner of whether your blog will be targeted or not, it can cause problems in some circumstances.</li>
</ol>
<p>As a Webmaster, even if you do not advertise, you should probably be at least somewhat aware of the value of your keywords and at least understand if they would pay well or not. Doing so may not help you prevent scraping but it at least gives you an idea of how much you should be on the lookout for it and how much time you should dedicate to it.</p>
<p>After all, knowing your enemy is half the battle and there is no better way to understand how they might respond than by knowing what they want and how they plan to obtain it from you&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.plagiarismtoday.com/2008/03/13/spam-blogs-and-adsense-dollars/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Why Blog Searching Fails</title>
		<link>http://www.plagiarismtoday.com/2008/03/07/why-blog-searching-fails/</link>
		<comments>http://www.plagiarismtoday.com/2008/03/07/why-blog-searching-fails/#comments</comments>
		<pubDate>Fri, 07 Mar 2008 17:51:48 +0000</pubDate>
		<dc:creator>Jonathan Bailey</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[Personal Experiences]]></category>
		<category><![CDATA[Products]]></category>
		<category><![CDATA[Punditry]]></category>
		<category><![CDATA[blog search]]></category>
		<category><![CDATA[Content-Theft]]></category>
		<category><![CDATA[Copyright-Infringement]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[google blog search]]></category>
		<category><![CDATA[icerocket]]></category>
		<category><![CDATA[Plagiarism]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[Search-Engines]]></category>
		<category><![CDATA[searching]]></category>
		<category><![CDATA[technorati]]></category>

		<guid isPermaLink="false">http://www.plagiarismtoday.com/2008/03/07/why-blog-searching-fails/</guid>
		<description><![CDATA[Blog searching has become an important tool both for gathering the information I use on this site, but also for staying on top of content theft issues. Unfortunately, blog searching has deteriorated to the point of uselessness and I break down why. ]]></description>
			<content:encoded><![CDATA[<p><img SRC="http://aycu40.webshots.com/image/45679/2002442769408806476_rs.jpg" align="left" class="picleft"/>Blog searching has been something of the holy grail for the new Web. Countless companies and search engines have worked to tap the endless stream of information that is the blogging world to deliver useful results in near-real time.</p>
<p>Bloggers, in turn, have used these tools to keep on top of the various topics, check for misuse of their content, seek out related posts and look for other sites referencing them.  </p>
<p>Many even use the tools without realizing. WordPress users, for example, use blog search every time they load their dashboard and see their &#8220;incoming links&#8221;. Others use it when they embed <a href="http://technorati.com/help/tags.html">Technorati Tags</a> into their posts. </p>
<p>However, blog searching is not working and it is getting worse. Where it once returned respectable results, today it throws out far more noise than signal. This has made it almost impossible to use the blog search for almost any practical use, making the only effective use of the technology to search for content theft and, even then, only in certain situations.</p>
<p>Unless something is done to fix blog search, it is only a matter of time before the technology is left completely by the wayside.<br />
<span id="more-839"></span></p>
<h4>My History</h4>
<p><img SRC="http://aycu12.webshots.com/image/48171/2002420190824064829_rs.jpg" align="right" class="picright"/>When I first started Plagiarism Today nearly three years ago, I started using blog search extensively. I subscribed to a series of related <a href="http://www.technorati.com">Technorati</a> watchlists and used my RSS reader to keep track of the latest happenings in the field of copyright and content theft. As time went on, I added RSS feeds form other search engines including <a href="http://www.icerocket.com">Icerocket</a> and <a href="http://www.google.com/blogsearch">Google Blog Search</a>. </p>
<p>Initially the system worked pretty well. Though there were some garbage results, for the most part every item in my RSS reader warranted opening in a browser. However, over time, the amount of noise grew at a much faster rate than the number legitimate posts, eventually outpacing it three or four times over.</p>
<p>Now, I feel as if I am inundated by these watchlists, getting dozens of results per hour, but only a fraction of which are actually original, human-written articles.</p>
<p>As a result, I spend as much time per day filtering out the junk from my feeds as I do responding to and bookmarking legitimate articles. </p>
<p>The system has become the model of inefficiency and is borderline useless. However, looking through my feeds, I think I know where much of the noise is coming from and why this problem has crept up on us.</p>
<h4>Issues With Blog Search</h4>
<p>When performing an autopsy on my blog search results recently, I noticed that there were six types of posts that were cluttering up my feeds and suffocating the original content.</p>
<ol>
<li><strong>Spam Blogs:</strong> Splogs are an easy target to blame for blog search clutter but they are not the worst players, truth be told. Since most simply parrot posts already available elsewhere in the feed, they are easy to ignore. Still, the nature of spam blogs pretty much ensures that every post I want to read is repeated two or three more times over the coming hours and days.
</li>
<li><strong>Non-Blogs:</strong> When blog searching started, pretty much only blogs had RSS feeds. Today, RSS feeds are included in forums, twitter pages, social networking profiles and more. Sadly, all of these things are routinely showing up in blog search results. While often relevant in terms of keywords, they are difficult to comment on and rarely provide any news.</li>
<li><strong>Comment Feeds:</strong> It seems that none of the blog search engines have been good at recognizing the difference between the main feed and the comment feed. I routinely see &#8220;Comment On&#8221; posts in my RSS reader. Typically they are posts I&#8217;ve already posted a comment to or the post is very old and is simply receiving a trickle of comments or trackbacks.</li>
<li><strong>Old Posts:</strong> It is a strange but increasingly common issue where the results that get returned are anything but timely. I regularly get posts marked as &#8220;new&#8221; that are really several weeks old. This morning, for example, one of my blog search feeds had two posts talking about the candidated &#8220;fighting for Texas&#8221; and reporting the <a href="http://www.plagiarismtoday.com/2008/02/20/the-obama-plagiarism-scandal/">allegations of plagiarism against Oboma</a> as new. The posts were dated in late February. </li>
<li><strong>Repeated Posts:</strong> I have RSS feeds from three of the major blog search engines so I fully expect to see the same post a few times on the different feeds. However, I regularly see the same post repeated on the same feed, sometimes even days or weeks later. Many times I click on a post, thinking it is new, and am stunned that I commented on it weeks ago and a quick check shows that it showed up on the same feed when it was brand new.</li>
<li><strong>Foreign Language Posts:</strong> Finally, even though my search terms are in English, I regularly get results where the bulk of the post is in another language. Though these blogs are often merely spam blogs using automatic translation, they are still of little use to me as I can&#8217;t read them, at least not without some translation, which none of the blog search engines seem to provide.</li>
</ol>
<p>This is not to say that all of the search engines suffer from the problems above equally. Google Blog Search, for example, is inundated with spam blogs and non-blog results, Technorati is worst about repeating posts and delivering content too late while Icerocket seems to return a large number of foreign language posts. </p>
<p>Of course, these aren&#8217;t the only things that have been limiting the usefulness of my blog search tools, but they are five of the biggest players to be certain. If I were going to start fixing blog search, these are the places I would start.</p>
<h4>Stinky Swiss Cheese</h4>
<p><img SRC="http://aycu03.webshots.com/image/48362/2002435910069554829_rs.jpg" align="left" class="picleft"/>Of course, the problem blog search isn&#8217;t just that its results are cluttered. I&#8217;d be willing to deal with a large amount of clutter in order to stay on top of the relevant news. Unfortunately, it is a regular occurrence that I miss stories, including those that the feed should have picked up.</p>
<p>I can offer no explanation for why this has happened, many of the sites are available if you perform a manual search so there should be no reason they are missed. </p>
<p>Fortunately, since I do use mulitple search engines, the problem is not as severe as if I had used only one. But even with all of these layers, some sites do not show up and the problem baffles me. </p>
<p>Of the three I use regularly, Icerocket seems to be the worst at picking up all of the articles I need, even regularly missing updates from Plagiarism Today, but the problem, most of the time, seems somewhat random and hard to pin on one or two engines.</p>
<h4>Effects on Content Theft</h4>
<p>When it comes to detecting content theft, many of the most popular tools, including <a href="http://www.plagiarismtoday.com/2007/05/24/copyfeed-plugin-now-available-in-english/">Copyfeed</a> and the <a href="http://www.plagiarismtoday.com/2006/10/05/update-digital-fingerprint-plugin-beta-2/">Digital Fingerprint Plugin</a>, both use blog search engines to detect misuse of the RSS feed.</p>
<p>In that regard, the fact that spam blogs and duplicate posts routinely show up in blog search engines is a good thing. It means that the detection is more reliable and that the odds of the bad guys showing up is good.</p>
<p>Unfortunately, the fact that they can reliably appear in the results also encourages spam blogging as an activity. Though blog search engines are not the ultimate target of sploggers, the fact that junk content routinely gets indexed in them does not discourage spammers. This means that, while we will see a large percentage of the scrapers, it means that there will be many more of them.</p>
<p>Furthermore, with the other issues raised with blog search engines, it appears that we&#8217;ll be stumbling through the clutter far more than we&#8217;ll be battling the bad guys. This could, in the long run, actually hinder our ability to locate and deal with spam bloggers, by not only throwing up excesses of noise, but making it impossible to target the ones that pose the greatest threat.</p>
<h4>Conclusions</h4>
<p>What is striking is how different the blog search results are compared to the traditional search engines. Though Google and Yahoo! may take a few more days to pick up their results, they are almost completely free of spam blogs and seem, overall, to present relevant content in an organized manner.</p>
<p>Where Google Blog Search is a dismal failure, the main Google search engine is a triumph. This shows not just the different challenges each search engine faces, but a lack of focus and effort on creating the best blog search results possible.</p>
<p>It is clear that blog search has not been given the attention it needs to thrive. The companies involved, even those who have blog search as their sole business, have decided that it is not worth dedicating the resources toward fixing these issues.</p>
<p>Sadly, these are not new problems that only unveiled themselves recently, they have been ongoing issues that have simply grown larger and larger to the point that they now drag the entire system down. Where once they were minor nuisances begging to be nipped in the bud, now they are huge problems that feel almost too big to tackle.</p>
<p>Sadly, it is only a matter of time before blog search becomes completely useless, if it isn&#8217;t there already, and it is a problem that many of us have seen coming for a very long time. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.plagiarismtoday.com/2008/03/07/why-blog-searching-fails/feed/</wfw:commentRss>
		<slash:comments>37</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Page Caching using disk: enhanced

Served from: www.plagiarismtoday.com @ 2012-02-13 11:45:41 -->
