<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Plagiarism Todaysplog | Plagiarism Today</title>
	<atom:link href="http://www.plagiarismtoday.com/tag/splog/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.plagiarismtoday.com</link>
	<description>Content Theft, Plagiarism, Copyright Infringement</description>
	<lastBuildDate>Mon, 13 Feb 2012 06:51:37 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Cpedia: A Spam Blog Disguised as an Encyclopedia</title>
		<link>http://www.plagiarismtoday.com/2010/06/09/cpedia-a-spam-blog-disguised-as-an-encyclopedia/</link>
		<comments>http://www.plagiarismtoday.com/2010/06/09/cpedia-a-spam-blog-disguised-as-an-encyclopedia/#comments</comments>
		<pubDate>Wed, 09 Jun 2010 16:21:04 +0000</pubDate>
		<dc:creator>Jonathan Bailey</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[Content-Theft]]></category>
		<category><![CDATA[Copyright]]></category>
		<category><![CDATA[copyright infirngement]]></category>
		<category><![CDATA[Copyright-Law]]></category>
		<category><![CDATA[cpedia]]></category>
		<category><![CDATA[cuil]]></category>
		<category><![CDATA[encyclopedia]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Plagiarism]]></category>
		<category><![CDATA[search engine]]></category>
		<category><![CDATA[spam-blog]]></category>
		<category><![CDATA[splog]]></category>
		<category><![CDATA[wikipedia]]></category>

		<guid isPermaLink="false">http://www.plagiarismtoday.com/?p=6818</guid>
		<description><![CDATA[CPedia is a new automatically-generated encyclopedia from the makers of Cuil. So why is it throwing thousands of pages of duplicate content into Google?]]></description>
			<content:encoded><![CDATA[<p><img style=' float: left; padding: 4px; margin: 0 7px 2px 0;'  src="http://www.plagiarismtoday.com/wp-content/uploads/2010/06/cpedia-logo.jpg" alt="" title="cpedia-logo" width="255" height="88" class="alignleft size-full wp-image-6825"></p>
<p>Last week, <a href="http://twitter.com/melebeth">@melebeth</a> introduced me to <a rel="nofollow" href="http://cpedia.com">CPedia</a>, a new &#8220;encyclopedia&#8221; by the makers of <a href="http://cuil.com">Cuil</a>, a search engine that was initially greeted with much fanfare before <a href="http://techcrunch.com/2008/12/27/cuil-fail-traffic-nearly-hits-rock-bottom/">seemingly flaming out</a>. </p>
<p>Cpedia is not an encyclopedia in the strictest sense as it is not written by human beings. Unlike traditional encyclopedias, which are written by paid experts, or Wikipedia, which is written largely by volunteers, Cpedia is written automatically from the search result pages creating an automated, and <a href="http://gigaom.com/2010/04/16/cpedia-founder-errors/">often wildly inaccurate</a> encyclopedia-like page. </p>
<p>For example, <a rel="nofollow" href="http://www.cpedia.com/wiki/Jonathan_Bailey_of_Plagiarism_Today_(all_pages)">I have a Cpedia, page</a> as well as <a rel="nofollow" href="http://www.cpedia.com/wiki?q=Plagiarism+Today">one for this site</a>, though my personal page doesn&#8217;t actually say anything about me and the one for PT seems to discuss random people/items only tangentially related to the site.</p>
<p>However, the concern Melebeth approached me with was not just about the accuracy of Cpedia, but about the way it used content from other sources. According to her, the search engine was lifting text directly from third-party sites but not properly quoting or citing it.</p>
<p>So, I delved into Cpedia and found, unfortunately, that her fears were largely founded.<span id="more-6818"></span></p>
<h4>How CPedia Works</h4>
<p><img style=' float: right; padding: 4px; margin: 0 0 2px 7px;'  src="http://www.plagiarismtoday.com/wp-content/uploads/2010/06/pt-cpedia-sample-217x300.jpg" alt="" title="pt-cpedia-sample" width="217" height="300" class="alignright size-medium wp-image-6827"></p>
<p>The basic idea behind CPedia is that it combs through the search results for a relevant term and tries to build out an encyclopedia entry automatically. The results, visually, are very similar to Wikipedia but the content is generally more jumbled and difficult to read. </p>
<p>Cpedia does attribute the content it uses, but in a very strange way. If you click or hover your mouse over the text within an article, but not the link, you will be given a sidebar that shows the text that&#8217;s been quoted and a link to the source in the sidebar. If you click the inline text link, you are instead taken to a references page that then links to the original source. </p>
<p>Usually, the individual copied passages are very short though, sometimes, the word count of the passage approached 100 words, especially when the source package was broken up into multiple parts.</p>
<p>CPedia seems to have a nearly unlimited number of topics covered, likely aided by the fact it is automatically generating results, and has many pages that Wikipedia does not, including one for me.</p>
<p>All in all, CPedia is fairly straightforward but that does not mean it isn&#8217;t a problem. In fact, in this case, it means quite the opposite.</p>
<h4>Problems With Cpedia</h4>
<p>Apart from the questionable accuracy of Cpedia, the entire operation, to me, seems highly suspect. The idea of creating new pages of content using snippets from dozens, even hundreds of other pages seems to be a very poor way to do business. </p>
<p>But even discarding the way the content is created, there are several issues with the attribution issue alone. Consider the following two problems:</p>
<ol>
<li><strong>Always One Step From Source Link:</strong> Whether you hover over the text or click to the references page, you are always one action away from the source link. This means users and other search engines alike are always two steps from the source site even though it would be trivial to make it one.</li>
<li><strong>Lack of Clear Quotes:</strong> The entire entry is made up of short verbatim quotes from various sources but it is not clear where the quotes begin and end without hovering over the text. The goal is to make the entire work seem like an original creation, an actual encyclopedia entry, without much in the way of visible quotes, just traditional footnote citations.</li>
</ol>
<p>However, the bigger problem is actually very simple. There are already many sites that build thousands and thousands of entries using snippets from various other pages. They&#8217;re called spam blogs and they use a variety of article generation and spinning technology to build new articles out of hodgepodges of existing ones.</p>
<p>And Cpedia is acting very much like a spam blog. Entries from CPedia are appearing in Google, <a href="http://www.google.com/search?hl=en&#038;safe=off&#038;client=safari&#038;rls=en&#038;q=site%3Acpedia.com&#038;aq=f&#038;aqi=&#038;aql=&#038;oq=&#038;gs_rfai=">which currently has about 177,000 entries indexed</a>, and though, <a rel="nofollow" href="http://cpedia.com/robots.txt">Cpedia&#8217;s robots.txt</a> disallows the wiki directory, it doesn&#8217;t seem to be stopping search engines from indexing the entries.</p>
<p>When you factor all of this together, it becomes clear that Cpedia is acting exactly like a spam blog and less like an encyclopedia. Was the intention? Probably not. But it is how the site is functioning, pumping thousands of pages of poorly-written duplicate content into the major search engines.</p>
<p>If that is not the hallmark of a spam blog, I&#8217;m not sure what is.</p>
<h4>Making it Stop</h4>
<p>To be clear, what Cpedia is doing isn&#8217;t, most likely, illegal. Fair use would likely protect their very limited use of the content from each individual source. This is one of the reasons this technique is so common among spam blogs is that it makes them almost immune to copyright disputes as a means of closure.</p>
<p>In short, even though the ethics of Cpedia can be hotly debated, most likely they are on the right side of the law.</p>
<p>That being said, if you want your work removed from Cpedia, all you have to do is remove it from Cuil and that can be done by <a rel="nofollow" href="http://www.cpedia.com/info/webmaster_info/">using robots.txt to block &#8220;twiceler&#8221;</a>.</p>
<p>Also, you can block the IP range that Cuil uses for crawling, which is also listed on the link above.</p>
<p>It is a fairly simple change to make and one that is relatively easy to make. (Note: I have not and will not make it on PT, I keep my robots.txt open intentionally to help observe various issues, like these).</p>
<p>All in all, though I disagree strongly with what Cpedia is doing, they do have the right to do it. This makes fighting back trickier, but far from impossible.</p>
<h4>Bottom Line</h4>
<p>What Cpedia is doing, in my opinion, is unethical. They are using quotes from various sites without adequate clarity or attribution. They are pumping thousands of pages of admittedly duplicate content into other search engines and are producing and encyclopedia that, by their own admission, is wildly inaccurate. </p>
<p>Though copyright may not be a viable litigation route, I have to wonder how libel will work in this case as repeating libel is, generally, <a href="http://www.dancingwithlawyers.com/freeinfo/libel-slander-mis-information.shtml">the same as making the libelous statement</a>. In short, those admitted inaccuracies in Cpedia could, in theory, come back to bite the company at a later date.</p>
<p>Considering that search engine liability in cases of libel is still being settled around the world, <a href="http://newsinfo.inquirer.net/breakingnews/infotech/view/20100425-266422/Google-fined-for-pedophile-libel-against-priest">Google won such a claim in the UK</a> but republishing this information on your own site and admitting it is inaccurate seems to be opening up new avenues for liability.</p>
<p>Would this be a likely claim against Cuil/Cpedia? Probably not. But only because the audience for the site is so small that it seems unlikely many will care. The fact that Cuil/Cpedia has seen so little success is a big part of why webmasters haven&#8217;t noticed the spammy nature of the issue and taken up arms.</p>
<p>To be certain, Cpedia flew under my radar until Melebeth asked me about it. I can imagine it is doing the same for many others right now as well.  </p>
]]></content:encoded>
			<wfw:commentRss>http://www.plagiarismtoday.com/2010/06/09/cpedia-a-spam-blog-disguised-as-an-encyclopedia/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>WordPressDirect Addresses Spam Issue</title>
		<link>http://www.plagiarismtoday.com/2008/12/02/wordpressdirect-addresses-spam-issue/</link>
		<comments>http://www.plagiarismtoday.com/2008/12/02/wordpressdirect-addresses-spam-issue/#comments</comments>
		<pubDate>Tue, 02 Dec 2008 15:00:19 +0000</pubDate>
		<dc:creator>Jonathan Bailey</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[Content-Theft]]></category>
		<category><![CDATA[Copyright-Infringement]]></category>
		<category><![CDATA[Copyright-Law]]></category>
		<category><![CDATA[Plagiarism]]></category>
		<category><![CDATA[Spam-Blogs]]></category>
		<category><![CDATA[Spamming]]></category>
		<category><![CDATA[splog]]></category>
		<category><![CDATA[Wordpress]]></category>
		<category><![CDATA[wordpressdirect]]></category>

		<guid isPermaLink="false">http://www.plagiarismtoday.com/?p=2197</guid>
		<description><![CDATA[WordPressDirect, a move that it hopes will placate the concerns many have expressed about the service, is removing auto-posting from free members. But is it enough to calm the angry mob?]]></description>
			<content:encoded><![CDATA[<p><img style=' float: left; padding: 4px; margin: 0 7px 2px 0;'  src="http://www.plagiarismtoday.com/wp-content/uploads/2008/12/wordpressdirect-logo-300x52.png" alt="wordpressdirect-logo" title="wordpressdirect-logo" width="300" height="52" class="alignleft size-medium wp-image-2198" />WordPressDirect, the controversial WordPress setup and management service that was <a href="http://mashable.com/2008/11/23/wordpressdirect/">covered on Mashable</a> and <a href="http://www.blogherald.com/2008/11/24/wordpressdirect-blogging-tool-or-spam-engine/">by myself on the Blog Herald</a>, has announced a change in its policy that it hopes will alleviate many of the spam concerns.</p>
<p>The policy change, will remove all of the automated content posting features from free user accounts, which make up the &#8220;vast majority&#8221; of WPD members, according to Marty Rozmanith, the creator of WPD.</p>
<p>The tools, however, will remain available for all paid members of the service, regardless of the level they choose. </p>
<p>Previously, unpaid members had limited access to some of the content posting tools, including the Yahoo! Answers, article database and RSS posting tool, enabling free members, who were limited to only three blogs, to automatically post content from a variety of sources, typically without permission.</p>
<p>Whether this does anything to stem the vitriol that has been directed at the service remains to be seen, but I can&#8217;t see how many will be convinced, especially when there are so many difficult questions to be answered.<span id="more-2197"></span></p>
<h4>WordPressDirect Recap</h4>
<p>For those who did not read the previous articles about WPD, the service promotes itself as a &#8220;WordPress deployment and maintenance service that helps people especially those with very little technical experience) create a search-optimized WordPress blog.&#8221;</p>
<p>In short, it is a one-click install program that not only sets up the software, but also adds a theme, optimizes the permanlinks and makes a handful of other SEO-oriented changes. In that regard, it is much like <a href="http://www.netenberg.com/fantastico.php">Fantastico</a>, but with added features to help get the blog started.</p>
<p>However, WordPressDirect stepped into controversy with its add-on tools, which allow users to automatically update the blogs they create using content from a variety of sources including RSS feeds, online article databases and more.</p>
<p>This caused many, especially on the Mashable article, to accuse the site of being a spam service. In that regard, it does share many traits, especially when you look at how the tools work and where they pull their content from.</p>
<p>WordPressDirect attempted to defend itself against the accusations, blaming much of the problem on their marketing, but the attempts to make peace fell on deaf ears for the most part. This, in turn, led to the recent changes they just announced.</p>
<h4>Fixing the Problem?</h4>
<p>Most likely, these changes are going to do little to nothing to placate the mob that has formed around WordPressDirect. Though the changes mean that the 9000+ free members of the site will no longer have the ability to automatically scrape and repost content, it says nothing about the paid members. The limitations on free accounts, including just three blogs per user, effectively meant that no one could actually be a master spammer with a free account (unless they spammed WPD and set up thousands of accounts).</p>
<p>To many, including myself, this sounds like a very shrewd maneuver. Though it removes most of the users from the ability to do spam-like things, it does not affect the paid ones and the email contained several pitches for the paid packages. It seems not like an attempt to shed the spam-related but to profit from it.</p>
<p>This move does not remove these tools from the power users nor does it impact their bottom line in any meaningful way, other than perhaps adding a few new paid members.</p>
<p>WPD, as a service, is walking a very thin line. It is trying to proclaim itself to not be a spam tool while offering many of the exact same features that are found in spam applications. Though, as I said in my Blog Herald article, it would make a very poor spamming program, it is completely foreseeable and almost certain that users, likely even most users, would use it for that purpose.</p>
<p>Furthermore, issues such as the trademark concerns over the use of the WordPress name, the lack of attribution of copied works, etc. remain unaddressed. Though it is a good step, it seems to be one either too small or in an unrelated direction.</p>
<h4>Conclusions</h4>
<p>Shortly after my Blog Herald article was released, WPD sent out an email to all members saying that it was &#8220;most balanced article&#8221; he could find.</p>
<p>Though I try to balanced with all of my coverage, I can not hide the fact that WordPressDirect has me very uneasy and nervous. The service has far too much use for evil and, even though I don&#8217;t know if its creators built the service with such intentions, that is the use that instantly springs to mind for myself and many others.</p>
<p>The problem is that a service such as what WPD proclaims to be, a WordPress installation aid that auto-optimizes the blog, could be very useful. I could even see someone such as myself using it rather than keeping a WordPress checklist for every new blog I install (I routinely get recruited to help with WP installations). </p>
<p>But as useful as that could be, the service, is too hot to touch right now and I seriously doubt that is going to change with these recent revisions to their policies. Though I am going to keep an eye open on the marketing changes they mentioned, I don&#8217;t see WPD becoming any less of a tainted name anytime soon.</p>
<p>To repair its name, WPD is going to have to make sacrifices that may hurt its business. Sadly, it doesn&#8217;t seem to be what they are doing right now. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.plagiarismtoday.com/2008/12/02/wordpressdirect-addresses-spam-issue/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Copyright 2.0 Show &#8211; Episode 19 &#8211; McTakeDown</title>
		<link>http://www.plagiarismtoday.com/2007/08/13/copyright-20-show-episode-19-mctakedown/</link>
		<comments>http://www.plagiarismtoday.com/2007/08/13/copyright-20-show-episode-19-mctakedown/#comments</comments>
		<pubDate>Mon, 13 Aug 2007 15:25:11 +0000</pubDate>
		<dc:creator>Jonathan Bailey</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[Podcast]]></category>
		<category><![CDATA[Bay-TSP]]></category>
		<category><![CDATA[Content-Theft]]></category>
		<category><![CDATA[Copyright-Infringement]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[linux-veoh-perfect10]]></category>
		<category><![CDATA[Plagiarism]]></category>
		<category><![CDATA[sco]]></category>
		<category><![CDATA[Spam]]></category>
		<category><![CDATA[spam-blog]]></category>
		<category><![CDATA[splog]]></category>
		<category><![CDATA[viacom]]></category>
		<category><![CDATA[YouTube]]></category>

		<guid isPermaLink="false">http://www.plagiarismtoday.com/2007/08/13/copyright-20-show-episode-19-mctakedown/</guid>
		<description><![CDATA[It&#8217;s Monday again and that means it is time for another 40-minute episode of the Copyright 2.0 show. This week the show is filled to the brim with the usual copyright news, humor and sarcasm that has made the show so special. Also included is a special birthday announcement and my pathetic attempt to rewrite...]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s Monday again and that means it is time for another 40-minute episode of the Copyright 2.0 show. This week the show is filled to the brim with the usual copyright news, humor and sarcasm that has made the show so special. Also included is a special birthday announcement and my pathetic attempt to rewrite my own history.</p>
<p>All in all, it was a busy week in copyright news, a total of sixteen stories were covered including the following:</p>
<ul id="null">
<li>SCO lost much of its Linux copyright infringement suit</li>
<li>Eight more dogpile onto YouTube</li>
<li>Veoh launches a preemptive strike</li>
<li>Perfect 10 is at it again</li>
<li>Google mistakes its own blog as spam, deletes it</li>
<li>And Many more&#8230;</li>
</ul>
<p>You can <a href="http://go.numly.com/1847107081310184589">download the MP3 file here</a>. Those interested in subscribing to the show can do so via <a href="http://www.copyright20.com/podcasts/rss">this feed</a>.</p>
<p><a href="http://del.icio.us/copyright20/19">Show Notes</a></p>
<p>I also want to take a moment to link to <a href="http://arstechnica.com/articles/culture/plagiarism-and-falsified-data-slip-into-the-scientific-literature.ars">this story on Ars Technica</a> dealing with plagiarism issues in the scientific community. It is a great read and I wanted to cover it on the broadcast but there simply wasn&#8217;t any time. </p>
<p>[audio:http://go.numly.com/1847107081310184589]</p>
]]></content:encoded>
			<wfw:commentRss>http://www.plagiarismtoday.com/2007/08/13/copyright-20-show-episode-19-mctakedown/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
<enclosure url="http://go.numly.com/1847107081310184589" length="8055849" type="audio/mpeg" />
		</item>
		<item>
		<title>Is Blogger on the Offensive Against Spam?</title>
		<link>http://www.plagiarismtoday.com/2007/06/26/is-blogger-on-the-offensive-against-spam/</link>
		<comments>http://www.plagiarismtoday.com/2007/06/26/is-blogger-on-the-offensive-against-spam/#comments</comments>
		<pubDate>Tue, 26 Jun 2007 15:30:11 +0000</pubDate>
		<dc:creator>Jonathan Bailey</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[News]]></category>
		<category><![CDATA[Blogger]]></category>
		<category><![CDATA[Blogspot]]></category>
		<category><![CDATA[Content-Theft]]></category>
		<category><![CDATA[Copyright-Infringement]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Plagiarism]]></category>
		<category><![CDATA[RSS]]></category>
		<category><![CDATA[Scraping]]></category>
		<category><![CDATA[Spam]]></category>
		<category><![CDATA[spam-blog]]></category>
		<category><![CDATA[splog]]></category>
		<category><![CDATA[Sploggers]]></category>
		<category><![CDATA[Splogs]]></category>

		<guid isPermaLink="false">http://www.plagiarismtoday.com/2007/06/26/is-blogger-on-the-offensive-against-spam/</guid>
		<description><![CDATA[Updated Information Here As part of running this site, I subscribe to many different Technorati Watchlists. They help me keep up to date on the latest in content-theft and plagiarism-related issues. Unfortunately, I see a great deal of spam blogs on these watchlists. What&#8217;s worse, it can be hard to tell, when looking at my...]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.plagiarismtoday.com/2007/06/28/update-google-responds-regarding-blogspot-spam/">Updated Information Here</a></p>
<p>As part of running this site, I subscribe to many different <a href="http://www.technorati.com/watchlist/">Technorati Watchlists</a>. They help me keep up to date on the latest in content-theft and plagiarism-related issues. </p>
<p>Unfortunately, I see a great deal of spam blogs on these watchlists. What&#8217;s worse, it can be hard to tell, when looking at my RSS reader, which blogs are legitimate and which are junk. Thus, I often end up clicking through to the splogs that successfully penetrate Technorati&#8217;s armor.</p>
<p>Most of those spam blogs have, traditionally, been on Blogspot. However, over the past week or so, I&#8217;ve noticed that a lot of the Blogspot links have been returning results like this indicating that the blog has been locked down for &#8220;Possible Blogger terms of service violations&#8221;.</p>
<p>It appears that, at least based upon the sample I have, that Blogger is on a major offensive against spam blogs and that their effectiveness has gone up drastically over the past week or so. If true, this could be great news for bloggers, especially those on Google&#8217;s service, but more research is needed before a victory can be claimed.</p>
<p>I am looking into this matter and am trying to find out exactly what is going on. It could just be that Google has discovered the network responsible for most of the spam targeting my keywords and all of this is a fluke. However, I wanted to pose the question to everyone reading this: Have you noticed a reduction in spam from Blogspot?</p>
<p>I&#8217;ll be interested to hear if others are having similar experiences. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.plagiarismtoday.com/2007/06/26/is-blogger-on-the-offensive-against-spam/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
		</item>
		<item>
		<title>Using Creative Commons to Stop Scraping</title>
		<link>http://www.plagiarismtoday.com/2007/06/05/using-creative-commons-to-stop-scraping/</link>
		<comments>http://www.plagiarismtoday.com/2007/06/05/using-creative-commons-to-stop-scraping/#comments</comments>
		<pubDate>Tue, 05 Jun 2007 17:50:35 +0000</pubDate>
		<dc:creator>Jonathan Bailey</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[Legal Issues]]></category>
		<category><![CDATA[News]]></category>
		<category><![CDATA[Prevention]]></category>
		<category><![CDATA[cc]]></category>
		<category><![CDATA[cc-licenses]]></category>
		<category><![CDATA[Copyright]]></category>
		<category><![CDATA[Copyright-Law]]></category>
		<category><![CDATA[Creative-Commons]]></category>
		<category><![CDATA[RSS]]></category>
		<category><![CDATA[Scraping]]></category>
		<category><![CDATA[spam-blog]]></category>
		<category><![CDATA[splog]]></category>
		<category><![CDATA[Splogging]]></category>

		<guid isPermaLink="false">http://www.plagiarismtoday.com/2007/06/05/using-creative-commons-to-stop-scraping/</guid>
		<description><![CDATA[Many sites, including this one, have expressed concerns that CC licenses may be encouraging or enabling scraping. The problem seems to be straightforward. If a blog licenses all of their content under a CC license, then a scraper that follows the terms of said license is just as protected as a human copying one or...]]></description>
			<content:encoded><![CDATA[<p>Many sites, <a href="http://www.plagiarismtoday.com/2005/12/13/creative-commons-license-to-splog/">including this one</a>, have <a href="http://www.blogmaverick.com/2005/12/10/attack-of-the-splogs-revisited/">expressed concerns</a> that CC licenses may be encouraging or <a href="http://openswitch.org/journal/copyright-and-the-blogger">enabling scraping</a>. </p>
<p>The problem seems to be straightforward. If a blog licenses all of their content under a CC license, then a scraper that follows the terms of said license is just as protected as a human copying one or two works. This may be within the letter of the license, but it violates the spirit of Creative Commons.</p>
<p>However, after talking with <a href="http://creativecommons.org/about/people#21">Mike Linksvayer</a>, the Vice President of Creative Commons, I&#8217;m relieved to say that is not the case. CC licenses have several built-in mechanisms that can prevent such abuse.</p>
<p>In fact, when one looks at the future of RSS, it is quite possible that using a CC license might provide better protection than using no license at all. </p>
<p><span id="more-509"></span><strong>Against the Spirit: A Crisis with the Commons?</strong></p>
<p>Whether or not some scrapers target CC licensed material or not is up for debate, what is clear is that, when they do, it is often a source of frustration. </p>
<p>People choose CC licenses because they want to share their work with others. They want to participate in a cultural revolution and give their ideas new wings. They do not, generally, want to see their entire site mirrored elsewhere, surrounded by Adsense ads and depriving them of traffic.</p>
<p>Ideally, a CC license is supposed to be symbiotic. The licensor gives up certain rights to their work and the licensee, in exchange for use of the work, makes certain the original author gets due credit and is rewarded for his or her effort. Spam bloggers, however, approach the CC license in bad faith, taking as much as they can while giving as little as possible back.</p>
<p>This has prompted many CC license users to either drop or alter their license. It has become common for sites that are being scraped to <a href="http://www.micropersuasion.com/2005/12/blog_content_th.html">change their licenses to &#8220;non-commercial&#8221;</a>, stop using CC licenses or even shut down their sites altogether.</p>
<p>However, though these spam blogs do seem to be following the terms and conditions of the Creative Commons Licenses, even if by accident, the vast majority do not. In fact, even enabling commercial use of your work is not an open invitation to be scraped.</p>
<p>As it turns out, CC licenses have built in mechanisms that can be used to fight that kind of abuse.</p>
<p><strong>Where Computers Fear to Tread</strong></p>
<p>For the use of a CC licensed work to be valid, according to Linksvayer, the following terms must be met among others:</p>
<ol>
<li>The work must be attributed and it must provide a link back to the copyright holder.
</li>
<li>If the license is non-commercial, then the work must be used accordingly.
</li>
<li>If a license has a share-alike term attached to it, then the copied work must express the same license.</li>
<li>All CC licensed material must state that it is licensed as such, usually with a statement that says &#8220;This work is licensed under a Creative Commons License&#8221;. Failure to do so puts the reuse in violation. </li>
<li>Finally, with Creative Commons, the licensor has the right to request removal of their name from any reused content, failure to comply puts the reuse in violation of the license.</li>
</ol>
<p>The problem with all of this is that it is almost impossible for an automated scraping system to comply with all of these elements. </p>
<p>Though some spam bloggers do attribute and link back, most do not. All spam blogging, at least in theory, is a violation of the non-commercial license and, since no spam blogs I have seen carry over CC information, they are in violation of the attribution and share-alike attributes of the Creative Commons License.</p>
<p>Even if a scraper manages to comply with the first four mechanisms above, it is unlikely that, when asked, they would remove the name from any work they reused. Spammers, seeking to automate their operations, are unlikely to edit their spam blogs by hand to appease one copyright holder.</p>
<p>The result is that virtually all automated scraping and spam blogging is a violation of the Creative Commons License, regardless of what license is used.</p>
<p><strong>Technicalities and Human Error</strong></p>
<p>Some of these attributes, however, are relatively unknown. Though most people understand what is and is not acceptable with the various CC licenses, many of the nuances of using CC licenses, such as the fourth mechanism, are little known or followed, even by humans seeking to play fair.</p>
<p>However, most copyright holders, often in the dark about the requirements themselves, do not hold human copiers to these standards. So long as they get the attribution and reuse that they envision, they typically do not raise any alarms if there isn&#8217;t a &#8220;This work is licensed under&#8230;&#8221; statement in the reused content.</p>
<p>The question is whether or not it is fair to hold scrapers to a higher standard than we generally hold other people. While it certainly is the right of the copyright holder to determine which misuses of their work they follow up on and deal with, many will, likely, feel uneasy about using largely unenforced technicalities against spam bloggers.</p>
<p>But even those who feel uneasy about enforcing those elements of the Creative Commons license may still benefit from applying one, especially to their feed. With some very difficult questions about copyright and RSS feeds unanswered, having a defined license on your feed could be come critical.</p>
<p><strong>Implied Licenses and RSS</strong></p>
<p>Though most attorneys I know and have spoken with feel that <a href="http://www.plagiarismtoday.com/2007/01/29/twil-discusses-implied-licenses-on-rss-feeds/">there is no implied license to scrape and republish RSS feeds</a>, the question has not yet come before a court and the outcome, as with all cases pushing new territory, is unpredictable at best.</p>
<p>However, if it is determined that RSS scraping and republishing is legal and that there is such an implied license with posting an RSS feed, attorney Denise Howell feels that any implied license can be <a href="http://betweenlawyers.corante.com/archives/2006/01/21/rss_and_copyright_the_no_example.php">overwritten by a defined one</a>, such as a Creative Commons License. </p>
<p>This makes sense consider that an implied license is one <a href="http://www.bitlaw.com/copyright/license.html#implied">designed to operate when there is no actual agreement</a> exists between the parties. If a specific license is posted, it would override the implied license.</p>
<p>We see this already on the Web. By posting a Web page to the Internet, the courts have found that there is an implied license for it to be indexed and cached by the search engines. However, once you state your intention for the page to not be used in such a manner, either through meta tags, robots.txt or manual opt-out, the implied license is dropped and the search engines, legitimate ones at least, have to comply with your requests.</p>
<p>The Creative Commons Organization is working on means of <a href="http://wiki.creativecommons.org/Syndication">integrating CC licenses into RSS feeds</a>. Hopefully this issue will garter more attention as the legal issues mount and a more final draft can be fleshed out.</p>
<p><strong>Conclusions</strong></p>
<p>The bottom line is that Creative Commons does not encourage or permit blind RSS scraping and spam blogging. Though it might be useful for legitimate aggregation, Creative Commons provides a great deal of protection against scraping, much more than previously thought.</p>
<p>Whether or not these mechanisms prove useful in fighting scraping remains to be seen. However, there is no longer a reason to hold back on a DMCA notice or a copyright complaint just because your commercial CC license seems to permit the use. Unless the scraper followed all of the requirements above, the use is still invalid.</p>
<p>Hopefully this will encourage the wider use of CC licenses, specifically the use of more liberal ones. I myself have removed the non-commercial requirement from my CC license as, like many others, my primary concern was commercial use by scrapers.</p>
<p>In the end, this is just another example of how Creative Commons, when used correctly, can work well for everyone and, in many cases, is good copyright policy. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.plagiarismtoday.com/2007/06/05/using-creative-commons-to-stop-scraping/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Page Caching using disk: enhanced

Served from: www.plagiarismtoday.com @ 2012-02-13 06:14:41 -->
