Cloaking to Stop Scraping

15

Cloaking is generally thought of as a black hat SEO tactic that involves tricking the search engine into believing that a page has one thing on it when, in fact, visitors will see something else, often completley unrelated. 

However, cloaking actually much more broad than that. Though the black hat use is the best known, it can also be used to offer different content to different browsers to ensure compatibility, different sized images to different screen resolutions or even display content in different languages based on country. Any time a page automatically dispalys different content to one person or one group of people than it does another, that is considered cloaking.

White hats, recently, found another positive use for cloaking, the ability to stop scraping by providing different content to a scraper than to the rest of the world. This has proved detrimental to one splogger and has earned one hacker his fifteen minutes of fame.

Best of all, the hacker in question showing the world how to repeat his trick, including offering the code to enable any WordPress user to fight back in much the same way.

The Ballad of RSnake

RSnake is a blogger at ha.ckers.org , an Internet security site that has been in operation since May.  Last week, he discovered that a scraper was stealing his content but, rather than filing a DMCA notice or even contacting the scrapers registrar, he decided to research his plagiarist and collected a great deal of personal information about him. Then, using a bit of coding, modified his RSS feed to return a different page when the scraper returned, causing him to scrape a lengthy paragraph of personal information including name, address and more. 

The plagiarist caught the change quickly and shut down the offending site while offering an apology for his misdeeds.

The response to RSnake's technique and the resulting shut down has been overwhelmingly positive. While many have been worried about the posting of such personal information (RSnake has since removed the information from his site), most have agreed that the idea of "cloaking" an RSS feed to thwart spammers is a very exciting.

This has prompted RSnake, as well as others, to offer up code to enable others to do the same with their own blog.

Needless to say, this could grow to be a powerful tool to help many, especially those with their own servers and blogging software, combat RSS scraping.

How it Was Done (WordPress)

The first step to cloaking content from a scraper is finding the IP address of the server doing the scraping. That can be tricky to do, especially if one isn't very comfortable with networking tools and terminology. but is usually just a matter of looking at the server that the scraped content appears on. Since most scrapers run their software on the same server that they publish their spam blogs on, the IP that is pulling the content is likely to be the same or very close to it. DomainTools.com can help you translate a site address into an IP address, greatly speeding up the process. 

Even with that information, you will likely need to search your server logs, which are available with most paid hosting accounts, to find the IP address of the person that is scraping your feed. Just look for an IP address that is close or identical to the server itself and check and see if the times roughly add up to when new posts appeard on the plagiarists' site.

Once you are relatively certain you have your plagiarist, all you need to do is insert the following code into your WordPress wp-rss2.php template (Note: This method works only in WordPress, we'll discuss other systems in a moment).

Special thanks to RSnake for sharing this code with all of us! 

First, look for the following lines:


 

Then, after that, add the following lines, replacing the Xs with the IP address of the scraper, the Ys with the fake descriptions and the Zs with fake content:



Finally, nine more lines down from that, close the "if" loop with the following tag:

If done correctly, it should forward the scraper to a fake RSS feed with whatever content you specified. To test it, out, place your IP address in the field and visit your feed. If you get the fake content, everything is working according to plan. 

As far as the fake content goes, a blank feed would produce the least amount of strain on bandwidth and server resources but it can contain whatever you desire, including a general broadcast saying something to the effect of, "If you are reading this, the site you are at is a scraper and is attempting to use my content illegally."

In that regard, one would be following in the footsteps of what visual artists have been doing for years to protect their images from being hotlinked and plagiarized. 

Other Systems 

While I am looking for similar code to work in other popular blogging platforms, such as MT, Textpattern, etc., one blogger, who, oddly enough, focuses on black hat SEO techniques, has devised a way to perform a similar cloak on any site that is hosted on server that has an editable .htaccess file (Note: This generally only includes paid hosting accounts).

Once one has discovered the IP address of their scraper and has created a page filled with fake content, perhaps a generated RSS feed filled with fake content, all they have to do is place these three lines in their .htaccess file:

RewriteEngine on
RewriteCond %{REMOTE_ADDR} ^XXX.XXX.XXX.XXX
RewriteRule ^(.*)$ http://newfeedurl.com/feed

Once again, the Xs represent the IP address of the scraper. However, this time, http://newfeedurl.com/feed represents the address of the fake content page or feed.

Needless to say, this method is only for individuals that have access to their .htaccess file and are comfortable editing it. If you need information about .htaccess, you can find an excellent guide on it here.

Limitations

There are several obvious limitations to this method. First, they can't be used, yet, by any free accounts such as Blogger, Myspace or Xanga. Though all of these sites produce RSS feeds, none offer the template access or the server access required to make this kind of redirection work (Note: It may be possible to do this with a free WordPress.com account, I do not know.)

Second, they can not be used by anyone taking advantage of Feedburner's service, at least not yet. Though Feedburner does an excellent job of protecting a feed by detecting and reporting "uncommon uses" of it, Feedburner does not offer a way to prevent certain individuals from accessing it. However, this may be a feature that will come out at a later date.

Also, these techniques are only for people that are comfortable wtih discovering a scraper's IP address and applying code either to a PHP or a .htaccess file in order to create the redirection. While those who are familiar with the technology and have used it for some time will have no problem with this, those who are new to running a site or haven't dabbled in these areas might be intimidated.

Finally, there's nothing to stop the scraper from simply moving his operations to a new IP address or a new domain altogether. While the same is true for DMCA notices and other other plagiarism cessation techniques, those usually result in the closure of the plagiarist's account, costing him money and time. With this, it is trivial to move his operations to another server that he or she has already prepared.

Still, it is a powerful and immediate way to stop a scraper. If nothing else, it might be a good stop gap measure to prevent further abuse while waiting for other countermeasures to actually close the site. It can also be very effective in international cases and situations where the host is extremely uncooperative.

Clearly, for those capable of using it, it is a tool to consider.

Conclusions

In the end, the effectiveness of this tool may be limited by one's technical prowess, how powerful their set up is and the nature of the scrapers that abuse their work. It may not be perfect for every person in every situation, but in the situations where it does work, it works almost perfectly. 

If nothing else, it is important to be aware of this weapon and consider it as a potential tool for legitimate bloggers to protect their content. Not only is it strangely fitting to turn a black hat SEO technique against the people that practice it, but it is also powerful, immediate and surgical in nature. In short, it is effective, quick and harms no one other than the plagiarist.

In that regard, it is many times better than many of the current available techniques out there and, while it isn't a replacement for more traditional routes, it can be a useful tool to stop and prevent RSS scraping.

It is an extra weapon in a war where our options are, for the most part, severely limited.  

[tags]Plagiarism, Content Theft, Scraping, Cloaking, Black Hat SEO, Splogging, Splogs[/tags] 

Want to Republish this Article? Request Free Permission Here. It's Free.

15 COMMENTS

  1. I also found that someone was ripping info from my site, and did something similar with .htaccess. I made up a fake RSS feed which had 1000 entries of nonsense. Then I redirected the spammer's IP to go to that fake RSS. You can read about it at my personal blog.

    While it's easy to do for one offender, it might get messy when so many splogs spring up daily.

  2. I also found that someone was ripping info from my site, and did something similar with .htaccess. I made up a fake RSS feed which had 1000 entries of nonsense. Then I redirected the spammer’s IP to go to that fake RSS. You can read about it at my personal blog.

    While it’s easy to do for one offender, it might get messy when so many splogs spring up daily.

  3. Excellent writeup! Plenty of easy ways to get around ip-specific stuff like this, but at least it’s likely to catch a bunch of unsuspecting sploggers that employ the methods used here.

    I’ve been meaning to ask, what are your thoughts on micoformats, particularly hreview, making it so simple to reproduce and misuse (intentionally or not) another’s data along with a laundry list of other aggregating and indexing issues for hreviews?

    If you have no idea what I’m talking about don’t worry (yet). I’ve done some writeups about some of the issues on my blog if you care to take a peek. I think it would interest you.

    J?

  4. Excellent writeup! Plenty of easy ways to get around ip-specific stuff like this, but at least it’s likely to catch a bunch of unsuspecting sploggers that employ the methods used here.

    I’ve been meaning to ask, what are your thoughts on micoformats, particularly hreview, making it so simple to reproduce and misuse (intentionally or not) another’s data along with a laundry list of other aggregating and indexing issues for hreviews?

    If you have no idea what I’m talking about don’t worry (yet). I’ve done some writeups about some of the issues on my blog if you care to take a peek. I think it would interest you.

    JÄ?

  5. However there’s weakness in this method. How if I am currently on the same proxy server as the spammers? Wouldn’t that mean that I won’t be able to receive the correct RSS?

  6. This is a neat concept. The ones that really floor me are these clowns who create a blog based on a google search term where they grab any post hit by the given term. Even in quantity I can't imagine how this gets them any reasonable revenue. The sites are horrid looking and often badly formatted, usually with just Google Adsense. Where would their traffic come from?

    Plus you have to wonder how you'd try to get them taken down since they usually link back to the original article and don't always use the whole thing. I've never worried too much about it but it sure makes your incoming links look weird.

  7. Hung,

    Thanks for the link! I greatly appreciate it. It’s a good read and anyone that is interested in this article needs to take a look at it.

    Somber One,

    You don’t put the code directly into your feed, but rather, in the template for your feed. It only works with WordPress and, apparently, not with free wordpress.com accounts. If you have your own install of WordPress, which I don’t believe you have on your site, you can edit the templates.

    I’m working on getting other code for different formats.

    Ja,

    Yes, there are plenty of ways to get around it, fortunately sploggers are a “set and forget” crowd that likely won’t bother. It’s just easier to move on.

    I can honestly say I don’t know much about Microformats. I’m going to take some time to look at it today and I’ll see about doing a writeup either Friday or Monday. I’ll look at your site a little bit later.

    Thanks for the tip!

    Merideth,

    Thanks for the information, it’s disappointing, but not unexpected. I’ll edit the article in a second to reflect that.

    Oskar,

    There are weaknesses, but it is very unlikely that you’ll be on the same proxy as the splogger. THe reason is that splogger software runs on the server itself usually, the same as the site. Odds are you won’t have your home connection on the same proxy as a Web site.

    There are plenty of weaknesses, but I don’t see this as a common problem. Let me know if I’m wrong though, there might be something I’m not seeing.

  8. Hung,

    Thanks for the link! I greatly appreciate it. It’s a good read and anyone that is interested in this article needs to take a look at it.

    Somber One,

    You don’t put the code directly into your feed, but rather, in the template for your feed. It only works with WordPress and, apparently, not with free wordpress.com accounts. If you have your own install of WordPress, which I don’t believe you have on your site, you can edit the templates.

    I’m working on getting other code for different formats.

    Ja,

    Yes, there are plenty of ways to get around it, fortunately sploggers are a “set and forget” crowd that likely won’t bother. It’s just easier to move on.

    I can honestly say I don’t know much about Microformats. I’m going to take some time to look at it today and I’ll see about doing a writeup either Friday or Monday. I’ll look at your site a little bit later.

    Thanks for the tip!

    Merideth,

    Thanks for the information, it’s disappointing, but not unexpected. I’ll edit the article in a second to reflect that.

    Oskar,

    There are weaknesses, but it is very unlikely that you’ll be on the same proxy as the splogger. THe reason is that splogger software runs on the server itself usually, the same as the site. Odds are you won’t have your home connection on the same proxy as a Web site.

    There are plenty of weaknesses, but I don’t see this as a common problem. Let me know if I’m wrong though, there might be something I’m not seeing.

  9. This is a neat concept. The ones that really floor me are these clowns who create a blog based on a google search term where they grab any post hit by the given term. Even in quantity I can’t imagine how this gets them any reasonable revenue. The sites are horrid looking and often badly formatted, usually with just Google Adsense. Where would their traffic come from?

    Plus you have to wonder how you’d try to get them taken down since they usually link back to the original article and don’t always use the whole thing. I’ve never worried too much about it but it sure makes your incoming links look weird.

  10. locating a scraper using server IP address, when it is being run from a shared hosting, will be quite tricky. more than one users on the same IP address, at best you will find that IP belongs to hosting provider and no point publishing "personal details" of hosting provider.

    In my case, running a wpmu site, I got struck by scraping attack from Russian IP. I concluded it is a 'scraping' based on amount data they downloaded from my site in a short span of time (less than 12 hours). Data downloaded was like 500mb, closed to 12000 http request. that practically get my site almost inaccessible during that period. when I tried reverse DNS, not much information other than that IP belongs to a russian ISP.

    so my best remedy was just to block that IP range (95.108). Any way, blogs on my site are mostly in English, I doubt a russian will be reading it or registering a blog.

    I wonder how google (blogger.com) is dealing with this kind of problem.

LEAVE A REPLY