Cloaking to Stop Scraping

Jonathan BaileyJuly 17, 2006

6 minutes read

Cloaking is generally thought of as a black hat SEO tactic that involves tricking the search engine into believing that a page has one thing on it when, in fact, visitors will see something else, often completley unrelated.

However, cloaking actually much more broad than that. Though the black hat use is the best known, it can also be used to offer different content to different browsers to ensure compatibility, different sized images to different screen resolutions or even display content in different languages based on country. Any time a page automatically dispalys different content to one person or one group of people than it does another, that is considered cloaking.

White hats, recently, found another positive use for cloaking, the ability to stop scraping by providing different content to a scraper than to the rest of the world. This has proved detrimental to one splogger and has earned one hacker his fifteen minutes of fame.

Best of all, the hacker in question showing the world how to repeat his trick, including offering the code to enable any Wordpress user to fight back in much the same way.

The Ballad of RSnake

RSnake is a blogger at ha.ckers.org , an Internet security site that has been in operation since May. Last week, he discovered that a scraper was stealing his content but, rather than filing a DMCA notice or even contacting the scrapers registrar, he decided to research his plagiarist and collected a great deal of personal information about him. Then, using a bit of coding, modified his RSS feed to return a different page when the scraper returned, causing him to scrape a lengthy paragraph of personal information including name, address and more.

The plagiarist caught the change quickly and shut down the offending site while offering an apology for his misdeeds.

The response to RSnake's technique and the resulting shut down has been overwhelmingly positive. While many have been worried about the posting of such personal information (RSnake has since removed the information from his site), most have agreed that the idea of "cloaking" an RSS feed to thwart spammers is a very exciting.

This has prompted RSnake, as well as others, to offer up code to enable others to do the same with their own blog.

Needless to say, this could grow to be a powerful tool to help many, especially those with their own servers and blogging software, combat RSS scraping.

How it Was Done (Wordpress)

The first step to cloaking content from a scraper is finding the IP address of the server doing the scraping. That can be tricky to do, especially if one isn't very comfortable with networking tools and terminology. but is usually just a matter of looking at the server that the scraped content appears on. Since most scrapers run their software on the same server that they publish their spam blogs on, the IP that is pulling the content is likely to be the same or very close to it. DomainTools.com can help you translate a site address into an IP address, greatly speeding up the process.

Even with that information, you will likely need to search your server logs, which are available with most paid hosting accounts, to find the IP address of the person that is scraping your feed. Just look for an IP address that is close or identical to the server itself and check and see if the times roughly add up to when new posts appeard on the plagiarists' site.

Once you are relatively certain you have your plagiarist, all you need to do is insert the following code into your Wordpress wp-rss2.php template (Note: This method works only in Wordpress, we'll discuss other systems in a moment).

Special thanks to RSnake for sharing this code with all of us!

First, look for the following lines:

<?php the_category_rss() ?>
<guid isPermaLink="false"><?php the_guid(); ?></guid>

Then, after that, add the following lines, replacing the Xs with the IP address of the scraper, the Ys with the fake descriptions and the Zs with fake content:

<?php if ($_SERVER['REMOTE_ADDR'] == "XXX.XXX.XXX.XXX") : ?>
<description><![CDATA[YYYYY]]></description>
<content:encoded><![CDATA[ZZZZZ]]></content:encoded> <?php else : ?>

Finally, nine more lines down from that, close the "if" loop with the following tag:

<?php endif; ?>

If done correctly, it should forward the scraper to a fake RSS feed with whatever content you specified. To test it, out, place your IP address in the field and visit your feed. If you get the fake content, everything is working according to plan.

As far as the fake content goes, a blank feed would produce the least amount of strain on bandwidth and server resources but it can contain whatever you desire, including a general broadcast saying something to the effect of, "If you are reading this, the site you are at is a scraper and is attempting to use my content illegally."

In that regard, one would be following in the footsteps of what visual artists have been doing for years to protect their images from being hotlinked and plagiarized.

Other Systems

While I am looking for similar code to work in other popular blogging platforms, such as MT, Textpattern, etc., one blogger, who, oddly enough, focuses on black hat SEO techniques, has devised a way to perform a similar cloak on any site that is hosted on server that has an editable .htaccess file (Note: This generally only includes paid hosting accounts).

Once one has discovered the IP address of their scraper and has created a page filled with fake content, perhaps a generated RSS feed filled with fake content, all they have to do is place these three lines in their .htaccess file:

RewriteEngine on
RewriteCond %{REMOTE_ADDR} ^XXX.XXX.XXX.XXX
RewriteRule ^(.*)$ https://newfeedurl.com/feed

Once again, the Xs represent the IP address of the scraper. However, this time, https://newfeedurl.com/feed represents the address of the fake content page or feed.

Needless to say, this method is only for individuals that have access to their .htaccess file and are comfortable editing it. If you need information about .htaccess, you can find an excellent guide on it here.

Limitations

There are several obvious limitations to this method. First, they can't be used, yet, by any free accounts such as Blogger, Myspace or Xanga. Though all of these sites produce RSS feeds, none offer the template access or the server access required to make this kind of redirection work (Note: It may be possible to do this with a free Wordpress.com account, I do not know.)

Second, they can not be used by anyone taking advantage of Feedburner's service, at least not yet. Though Feedburner does an excellent job of protecting a feed by detecting and reporting "uncommon uses" of it, Feedburner does not offer a way to prevent certain individuals from accessing it. However, this may be a feature that will come out at a later date.

Also, these techniques are only for people that are comfortable wtih discovering a scraper's IP address and applying code either to a PHP or a .htaccess file in order to create the redirection. While those who are familiar with the technology and have used it for some time will have no problem with this, those who are new to running a site or haven't dabbled in these areas might be intimidated.

Finally, there's nothing to stop the scraper from simply moving his operations to a new IP address or a new domain altogether. While the same is true for DMCA notices and other other plagiarism cessation techniques, those usually result in the closure of the plagiarist's account, costing him money and time. With this, it is trivial to move his operations to another server that he or she has already prepared.

Still, it is a powerful and immediate way to stop a scraper. If nothing else, it might be a good stop gap measure to prevent further abuse while waiting for other countermeasures to actually close the site. It can also be very effective in international cases and situations where the host is extremely uncooperative.

Clearly, for those capable of using it, it is a tool to consider.

Conclusions

In the end, the effectiveness of this tool may be limited by one's technical prowess, how powerful their set up is and the nature of the scrapers that abuse their work. It may not be perfect for every person in every situation, but in the situations where it does work, it works almost perfectly.

If nothing else, it is important to be aware of this weapon and consider it as a potential tool for legitimate bloggers to protect their content. Not only is it strangely fitting to turn a black hat SEO technique against the people that practice it, but it is also powerful, immediate and surgical in nature. In short, it is effective, quick and harms no one other than the plagiarist.

In that regard, it is many times better than many of the current available techniques out there and, while it isn't a replacement for more traditional routes, it can be a useful tool to stop and prevent RSS scraping.

It is an extra weapon in a war where our options are, for the most part, severely limited.

[tags]Plagiarism, Content Theft, Scraping, Cloaking, Black Hat SEO, Splogging, Splogs[/tags]

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free

Jonathan BaileyJuly 17, 2006

6 minutes read

Want to Reuse or Republish this Content?

Follow us