Using .htaccess to Stop Content Theft

Jonathan BaileyJuly 2, 2007

6 minutes read

Having control over your own server can be a very powerful thing. It enables you to control who can access your site, how they visit it and what they can see.

Generally, however, that power is best left unused. For the most part, restricting people’s access to your site is a bad move. Though you can use your powers to carve out a members-0nly area or prevent others from accessing administrative areas of the site, turning people away from the door is usually unwise.

Still, there are some people that you want to keep out. RSS scrapers and image hotlinkers, for example, offer nothing to your site but instead only steal your content, your bandwidth and your other resources. If you can prevent them from accessing your site in the first place, without impacting other users, it is probably in your best interest to do so.

Fortunately, with Apache’s .htaccess file, it is possible to do all of those things and more. All one has to do is understand a few basics and get the code that they need.

A Quick Primer

According to the Apache Software Foundation, .htaccess is a distributed configuration file that provides “a way to make configuration changes on a per-directory basis”. It is most commonly used when a Webmaster has access to the server, but not the core configuration files for that server. This is typical of most shared hosting environments.

When editing an .htaccess file, there are three important things to remember:

.htaccess is the name of the file: In short, htaccess is the extension and there is no file name. This can make editing the file difficult on some computers, but it is important that the convention be followed. If needed, name the file something else and rename it after uploading it to your server.
It is an ASCII file: .htaccess is a plain text file and should only be edited in a text editor such as Notepad.
It only works with Apache: Though other servers, such as Microsoft’s IIS Server, offer similar features. .htaccess itself is only for Apache-based servers. If you are unsure of what kind of server you have, check with your hosting provider.

Finally, it is important, when working with .htaccess, to back up well and be careful with your edits. A poorly-constructed .htaccess file can render your site useless.

But despite these warnings, .htaccess files are, generally, very easy to edit and manipulate. Furthermore, there is a lot of very good free code ready for you to copy, paste and manipulate to fit your needs.

Stop Image/File Hotlinking

One of the easiest and most basic tasks that can be performed with .htaccess is stopping image/file hotlinking. This is the process by which other sites link directly to your files, either having them display or download directly from their site. This not only amounts to content theft, as well as often plagiarism, but also bandwidth theft as your server spends the resources to serve the file everyone someone on their site calls for it.

According to Zann Marketing, the process is very simple. All one has to do is navigate to their images folder and either create a new .htaccess file or add the following code to their existing one:

RewriteEngine on
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^https://yoursite.com.*$ [NC] RewriteCond %{HTTP_REFERER} !^https://www.yoursite.com.*$ [NC] ReWriteRule .*\.(gif|jpg|png)$ – [F]

The first line tells the server to turn the Rewrite engine on, the second line instructs the server to check and see if the referrer is blank, the third and fourth line check to make sure that is not from your own site and the fifth line instructs the server to disallow the request for the selected file types if none of the above statements are true.

With this code, you can easily modify it several different ways including:

Add New Domains: You can add new domains and sites to allow hotlinking from. The original example from Zann Marketing includes the IP address for Google Images, for example. You can include other search engines as well.
Add New File Types: By editing the last line, you can modify your rules to include any kind of file necessary including movie files, documents and anything else you wish to have protected from hotlinking.
Disable Access to Blank Referrers: By removing the second line, you can prevent access from browsers and tools that return a blank referrer. Though some scrapers and black hat spiders do this, so do many visitors in a bid to protect their privacy.

Though this method will not stop people from saving your images to their hard drive and uploading it where they please, it can prevent people from stealing both your image and your bandwidth at the same time.

Also, on the original Zann Marketing page, there are examples for blocking just one hotlinker and to redirect hotlinkers to another image, thus pulling the famous “switcheroo”.

Finally, if the process of editing the code seems too daunting, you can also use HTML Basix’s .htaccess code generator to create an .htaccess code set for you to copy and paste into your file.

Blocking RSS Scraping

Equally easy, or in some cases even easier, than blocking hotlinking is blocking RSS scrapers. All you need to do so is determine the IP address of the RSS scraper. (Note: You can use domains if you wish. However, since not all scraping software is located on the domain itself, IP addresses are more reliable).

The easiest way to determine the IP address of a scraper is to use the Copyfeed plugin and have it place the IP address in the scraped content. This not only eliminates the need to translate domain names into IP addresses, but also works in case where the scraping software is located on another server or computer.

However, if that fails or is not an option and the scraped site is hosted on its own domain, you can simply use the IP address for the server itself. To determine the IP address for a domain, simply enter it into a site like Domain Tools and let it get the IP for you. It only takes a few seconds.

If the scraper is using a free service such as Blogspot, you will likely have to look into your server logs and attempt to find traffic on the feed that times out with when the posts go up on the scraper site. It is a risky task to undertake as you can accidentally block legitimate users and it can be very time-consuming on larger sites, but it is the only option in some cases if you wish to use blocking techniques.

No matter what method you use, once you have the IP address, all that is required, according to JavascriptKit, is the following code in the .htaccess file of your feed’s directory:

order allow,deny
deny from xxx.xxx.xxx.xxx
allow from all

Editing the code is easy, all you have to do is replace the Xs with the IP address of the scraper. You can add more lines to as new scrapers emerge and you can also use wildcards by leaving off numbers. For example 123.123.123., would block all IP addresses that start with 123.123.123. This can be useful if a scraper has an IP that changes, but only within a certain range.

It is important to note that this code will block ALL access to your site for that IP address. However, there is very little reason to allow a scraper onto your site as, most likely, they are only accessing the feed anyway.

Also, if you want to redirect scrapers to a fake feed, you can use the method discussed on Hung Truong, which often generates very humorous results.

Finally, once again, if you are uneasy with editing the files yourself, HTML Basix offers another .htaccess generator, this one to block users. It is also useful to stop RSS scraping.

The bottom line is that, once you obtain the IP address of the scraper, it is trivial to block them using .htaccess. All you need is a little bit of understanding and some freely-available code.

Conclusions

Though .htaccess editing can seem very intimidating to a novice, it is actually very easy to do. With the proper tools and a few fundamentals, anyone can manipulate their .htaccess files and use it to their advantage.

Though these manipulations won’t do anything to stop human plagiarism it can stop some of the more common types of plagiarism before they happen, all without impacting legitimate users at all. It makes sense, if possible, to use these methods to your advantage.

However, it is important to note that you are unlikely to find a free host that allows manipulation of .htaccess files. This is, predominantly, the feature of paid hosting companies. Also, it will not work if your images and RSS feeds are on another server, such as Flickr or FeedBurner.

But if you’ve paid for your hosting, it makes sense to use the tools that come with that kind of an upgrade. One of those is the ability to protect your content at the server level.

It is a great power and one that is sorely underused on the Web.

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free

Jonathan BaileyJuly 2, 2007

6 minutes read

Want to Reuse or Republish this Content?

Follow us