Using .htaccess to Stop Content Theft

Having control over your own server can be a very powerful thing. It enables you to control who can access your site, how they visit it and what they can see.

Generally, however, that power is best left unused. For the most part, restricting people’s access to your site is a bad move. Though you can use your powers to carve out a members-0nly area or prevent others from accessing administrative areas of the site, turning people away from the door is usually unwise.

Still, there are some people that you want to keep out. RSS scrapers and image hotlinkers, for example, offer nothing to your site but instead only steal your content, your bandwidth and your other resources. If you can prevent them from accessing your site in the first place, without impacting other users, it is probably in your best interest to do so.

Fortunately, with Apache’s .htaccess file, it is possible to do all of those things and more. All one has to do is understand a few basics and get the code that they need.

A Quick Primer

According to the Apache Software Foundation, .htaccess is a distributed configuration file that provides “a way to make configuration changes on a per-directory basis”. It is most commonly used when a Webmaster has access to the server, but not the core configuration files for that server. This is typical of most shared hosting environments.

When editing an .htaccess file, there are three important things to remember:

  1. .htaccess is the name of the file: In short, htaccess is the extension and there is no file name. This can make editing the file difficult on some computers, but it is important that the convention be followed. If needed, name the file something else and rename it after uploading it to your server.
  2. It is an ASCII file: .htaccess is a plain text file and should only be edited in a text editor such as Notepad.
  3. It only works with Apache: Though other servers, such as Microsoft’s IIS Server, offer similar features. .htaccess itself is only for Apache-based servers. If you are unsure of what kind of server you have, check with your hosting provider.

Finally, it is important, when working with .htaccess, to back up well and be careful with your edits. A poorly-constructed .htaccess file can render your site useless.

But despite these warnings, .htaccess files are, generally, very easy to edit and manipulate. Furthermore, there is a lot of very good free code ready for you to copy, paste and manipulate to fit your needs.

Stop Image/File Hotlinking

One of the easiest and most basic tasks that can be performed with .htaccess is stopping image/file hotlinking. This is the process by which other sites link directly to your files, either having them display or download directly from their site. This not only amounts to content theft, as well as often plagiarism, but also bandwidth theft as your server spends the resources to serve the file everyone someone on their site calls for it.

According to Zann Marketing, the process is very simple. All one has to do is navigate to their images folder and either create a new .htaccess file or add the following code to their existing one:

RewriteEngine on
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://yoursite.com.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://www.yoursite.com.*$ [NC]
ReWriteRule .*\.(gif|jpg|png)$ – [F]

The first line tells the server to turn the Rewrite engine on, the second line instructs the server to check and see if the referrer is blank, the third and fourth line check to make sure that is not from your own site and the fifth line instructs the server to disallow the request for the selected file types if none of the above statements are true.

With this code, you can easily modify it several different ways including:

  1. Add New Domains: You can add new domains and sites to allow hotlinking from. The original example from Zann Marketing includes the IP address for Google Images, for example. You can include other search engines as well.
  2. Add New File Types: By editing the last line, you can modify your rules to include any kind of file necessary including movie files, documents and anything else you wish to have protected from hotlinking.
  3. Disable Access to Blank Referrers: By removing the second line, you can prevent access from browsers and tools that return a blank referrer. Though some scrapers and black hat spiders do this, so do many visitors in a bid to protect their privacy.

Though this method will not stop people from saving your images to their hard drive and uploading it where they please, it can prevent people from stealing both your image and your bandwidth at the same time.

Also, on the original Zann Marketing page, there are examples for blocking just one hotlinker and to redirect hotlinkers to another image, thus pulling the famous “switcheroo”.

Finally, if the process of editing the code seems too daunting, you can also use HTML Basix’s .htaccess code generator to create an .htaccess code set for you to copy and paste into your file.

Blocking RSS Scraping

Equally easy, or in some cases even easier, than blocking hotlinking is blocking RSS scrapers. All you need to do so is determine the IP address of the RSS scraper. (Note: You can use domains if you wish. However, since not all scraping software is located on the domain itself, IP addresses are more reliable).

The easiest way to determine the IP address of a scraper is to use the Copyfeed plugin and have it place the IP address in the scraped content. This not only eliminates the need to translate domain names into IP addresses, but also works in case where the scraping software is located on another server or computer.

However, if that fails or is not an option and the scraped site is hosted on its own domain, you can simply use the IP address for the server itself. To determine the IP address for a domain, simply enter it into a site like Domain Tools and let it get the IP for you. It only takes a few seconds.

If the scraper is using a free service such as Blogspot, you will likely have to look into your server logs and attempt to find traffic on the feed that times out with when the posts go up on the scraper site. It is a risky task to undertake as you can accidentally block legitimate users and it can be very time-consuming on larger sites, but it is the only option in some cases if you wish to use blocking techniques.

No matter what method you use, once you have the IP address, all that is required, according to JavascriptKit, is the following code in the .htaccess file of your feed’s directory:

order allow,deny
deny from xxx.xxx.xxx.xxx
allow from all

Editing the code is easy, all you have to do is replace the Xs with the IP address of the scraper. You can add more lines to as new scrapers emerge and you can also use wildcards by leaving off numbers. For example 123.123.123., would block all IP addresses that start with 123.123.123. This can be useful if a scraper has an IP that changes, but only within a certain range.

It is important to note that this code will block ALL access to your site for that IP address. However, there is very little reason to allow a scraper onto your site as, most likely, they are only accessing the feed anyway.

Also, if you want to redirect scrapers to a fake feed, you can use the method discussed on Hung Truong, which often generates very humorous results.

Finally, once again, if you are uneasy with editing the files yourself, HTML Basix offers another .htaccess generator, this one to block users. It is also useful to stop RSS scraping.

The bottom line is that, once you obtain the IP address of the scraper, it is trivial to block them using .htaccess. All you need is a little bit of understanding and some freely-available code.

Conclusions

Though .htaccess editing can seem very intimidating to a novice, it is actually very easy to do. With the proper tools and a few fundamentals, anyone can manipulate their .htaccess files and use it to their advantage.

Though these manipulations won’t do anything to stop human plagiarism it can stop some of the more common types of plagiarism before they happen, all without impacting legitimate users at all. It makes sense, if possible, to use these methods to your advantage.

However, it is important to note that you are unlikely to find a free host that allows manipulation of .htaccess files. This is, predominantly, the feature of paid hosting companies. Also, it will not work if your images and RSS feeds are on another server, such as Flickr or FeedBurner.

But if you’ve paid for your hosting, it makes sense to use the tools that come with that kind of an upgrade. One of those is the ability to protect your content at the server level.

It is a great power and one that is sorely underused on the Web.

42 comments
Linkbuildr
Linkbuildr

Thanks for the tip! I am getting out ranked by a scraper site badly recently in Google and hopefully this puts a stop to it.

Linkbuildr
Linkbuildr

Thanks for the tip! I am getting out ranked by a scraper site badly recently in Google and hopefully this puts a stop to it.

TechGyo
TechGyo

Great resource. I'm linking this article to one of my article related to content theft.

TechGyo
TechGyo

Great resource. I'm linking this article to one of my article related to content theft.

Jaydee Escobedo
Jaydee Escobedo

I wanted to discuss a problem that I am having with my website. I currently have a website hosted by GoDaddy and there are 2 domains that are pointing to my IP address. These 2 domains currently are showing my content but they do not belong to me. They are an exact replica of my website. My website is www.indesignsstudio.com and the 2 domains that are pointing to my IP address are www.chnyes.com and www.antiquecenter.us. How can I stop this from happening. I have contacted GoDaddy and they have adviced me that due to the nature of DNS that they have no control where domains point to. There advice is that I should create a script to stop this from happening or I can re-route the domains to another destination by setting them up as aliases on my hosting account. How do I do this since both domains do not belong to me. Through research I have found that both domains are currently on the same servers as mine. Any ideas would be helpful.

Jaydee Escobedo
Jaydee Escobedo

I wanted to discuss a problem that I am having with my website. I currently have a website hosted by GoDaddy and there are 2 domains that are pointing to my IP address. These 2 domains currently are showing my content but they do not belong to me. They are an exact replica of my website. My website is <a href="http://www.indesignsstudio.com" target="_blank">www.indesignsstudio.com and the 2 domains that are pointing to my IP address are <a href="http://www.chnyes.com" target="_blank">www.chnyes.com and <a href="http://www.antiquecenter.us" target="_blank">www.antiquecenter.us. How can I stop this from happening. I have contacted GoDaddy and they have adviced me that due to the nature of DNS that they have no control where domains point to. There advice is that I should create a script to stop this from happening or I can re-route the domains to another destination by setting them up as aliases on my hosting account. How do I do this since both domains do not belong to me. Through research I have found that both domains are currently on the same servers as mine. Any ideas would be helpful.

Jaydee Escobedo
Jaydee Escobedo

I wanted to discuss a problem that I am having with my website. I currently have a website hosted by GoDaddy and there are 2 domains that are pointing to my IP address. These 2 domains currently are showing my content but they do not belong to me. They are an exact replica of my website. My website is <a href="http://www.indesignsstudio.com" target="_blank">www.indesignsstudio.com and the 2 domains that are pointing to my IP address are <a href="http://www.chnyes.com" target="_blank">www.chnyes.com and <a href="http://www.antiquecenter.us" target="_blank">www.antiquecenter.us. How can I stop this from happening. I have contacted GoDaddy and they have adviced me that due to the nature of DNS that they have no control where domains point to. There advice is that I should create a script to stop this from happening or I can re-route the domains to another destination by setting them up as aliases on my hosting account. How do I do this since both domains do not belong to me. Through research I have found that both domains are currently on the same servers as mine. Any ideas would be helpful.

JB
JB

Drmike,

Very welcome! Good luck with making them stop!

AskApache,

That's high praise coming from your background. Thank you very much!

JB
JB

Drmike,

Very welcome! Good luck with making them stop!

AskApache,

That's high praise coming from your background. Thank you very much!

JB
JB

Drmike,
Very welcome! Good luck with making them stop!
AskApache,
That's high praise coming from your background. Thank you very much!

JB
JB

Drmike, Very welcome! Good luck with making them stop! AskApache, That's high praise coming from your background. Thank you very much!

JB
JB

Drmike,Very welcome! Good luck with making them stop!AskApache,That's high praise coming from your background. Thank you very much!

JB
JB

Drmike,Very welcome! Good luck with making them stop!AskApache,That's high praise coming from your background. Thank you very much!

AskApache
AskApache

Hey very impressive write up thanks!

AskApache
AskApache

Hey very impressive write up thanks!

AskApache
AskApache

Hey very impressive write up thanks!

drmike
drmike

Thanks for the links to the plugins. Getting tired of these idiots.

drmike
drmike

Thanks for the links to the plugins. Getting tired of these idiots.

drmike
drmike

Thanks for the links to the plugins. Getting tired of these idiots.

MacBros
MacBros

I've had to stop using the ban IP on my site because it was blocking legit users because of the rotating IP's in some countries and ISP's

MacBros
MacBros

I've had to stop using the ban IP on my site because it was blocking legit users because of the rotating IP's in some countries and ISP's

MacBros
MacBros

I've had to stop using the ban IP on my site because it was blocking legit users because of the rotating IP's in some countries and ISP's

Gabriel
Gabriel

Thank you! This is the only article I found that explained .htaccess configuration with regard to content theft clearly. This is a great article for beginners like me.

Gabriel
Gabriel

Thank you! This is the only article I found that explained .htaccess configuration with regard to content theft clearly. This is a great article for beginners like me.

Gabriel
Gabriel

Thank you! This is the only article I found that explained .htaccess configuration with regard to content theft clearly. This is a great article for beginners like me.

JB
JB

Jeremy,
Good point. Definitely a point to consider for people who put a lot of images in their feed...

JB
JB

Jeremy,

Good point. Definitely a point to consider for people who put a lot of images in their feed...

JB
JB

Jeremy,

Good point. Definitely a point to consider for people who put a lot of images in their feed...

JB
JB

Jeremy, Good point. Definitely a point to consider for people who put a lot of images in their feed...

Jeremy Steele
Jeremy Steele

The flaw with that image hotlinking htaccess entry is that if you run a blog, users who use Google reader or other RSS readers won't be able to see images.

Using a script to track bandwidth theft is a much better solution, because then you can tell who is actually stealing your bandwidth without interrupting users who are innocently using RSS readers to view the pictures.

Jeremy Steele
Jeremy Steele

The flaw with that image hotlinking htaccess entry is that if you run a blog, users who use Google reader or other RSS readers won't be able to see images.

Using a script to track bandwidth theft is a much better solution, because then you can tell who is actually stealing your bandwidth without interrupting users who are innocently using RSS readers to view the pictures.

Jeremy Steele
Jeremy Steele

The flaw with that image hotlinking htaccess entry is that if you run a blog, users who use Google reader or other RSS readers won't be able to see images. Using a script to track bandwidth theft is a much better solution, because then you can tell who is actually stealing your bandwidth without interrupting users who are innocently using RSS readers to view the pictures.

MacBros
MacBros

I've had to stop using the ban IP on my site because it was blocking legit users because of the rotating IP's in some countries and ISP's

MacBros
MacBros

I've had to stop using the ban IP on my site because it was blocking legit users because of the rotating IP's in some countries and ISP's

JB
JB

Jeremy,Good point. Definitely a point to consider for people who put a lot of images in their feed...

JB
JB

Jeremy,Good point. Definitely a point to consider for people who put a lot of images in their feed...

Jeremy Steele
Jeremy Steele

The flaw with that image hotlinking htaccess entry is that if you run a blog, users who use Google reader or other RSS readers won't be able to see images.

Using a script to track bandwidth theft is a much better solution, because then you can tell who is actually stealing your bandwidth without interrupting users who are innocently using RSS readers to view the pictures.

Jeremy Steele
Jeremy Steele

The flaw with that image hotlinking htaccess entry is that if you run a blog, users who use Google reader or other RSS readers won't be able to see images.
Using a script to track bandwidth theft is a much better solution, because then you can tell who is actually stealing your bandwidth without interrupting users who are innocently using RSS readers to view the pictures.

Jeremy Steele
Jeremy Steele

The flaw with that image hotlinking htaccess entry is that if you run a blog, users who use Google reader or other RSS readers won't be able to see images. Using a script to track bandwidth theft is a much better solution, because then you can tell who is actually stealing your bandwidth without interrupting users who are innocently using RSS readers to view the pictures.

Trackbacks

  1. [...] Bailey takes a close look at plagiarism and shares how bloggers can use their *htaccess file to stop content theft. If you are a prolific writer, do you frequently check to see how much of your material is being [...]

  2. [...] through an automated means, such as RSS scraping, consider using technology to block the site. Using your .htaccess file to block the scraper can do as much to stop the infringement as any legal notice. Sometimes, stopping plagiarism is as [...]

  3. [...] immediately set out to block Workfriendly, this time using a hand-coded .htaccess block, but not before trying to figure out what was causing the [...]

  4. [...] Using .htaccess to Stop Content Theft Protect Your Images with .htaccess [...]

  5. [...] Using .htaccess to Stop Content Theft: Read PlagiarismToday’s primer on .htaccess to learn how to use it to your advantage against content thieves. [...]

  6. [...] Using .htaccess to Stop Content Theft: Read PlagiarismToday’s primer on .htaccess to learn how to use it to your advantage against content thieves. [...]

  7. FAQPAL Blog says:

    [...] Plagiarism Today has put together a fantastic write on how to do both of these. [...]

  8. [...] Using .htaccess to Stop Content TheftExcellent article that explains how to use .htaccess to prevent hotlinking of images and feedscraping. A must-read if you run your own server. On PlagiarismToday. [...]

  9. [...] will simply not do so and unless the server takes some additional effort to block them, such as using an IP or user-agent block, both of which can also be mitigated against, there isn’t much to stop them.One other [...]

  10. [...] You can read more about editing .htaccess for stopping content theft. [...]

  11. [...] Using .htaccess to Stop Content Theft Protect Your Images with .htaccess [...]