Most of the debate about content reuse has focused on copying. Whether it is Creative Commons licenses allowing some copying, spammers scraping content en masse, plagiarists copying works without attribution or aggregators copying content questionably, most of the talk has centered around traditional copy/paste reuse.
This technique doesn’t require the copycat site to host any of the content itself, but rather, lets it pull the material directly from your site and then make it appear to both search engines and users alike that they are the ones providing it.
It’s become the subject of a great deal of debate on the Web and, as new services come online, seems likely to be an increasing problem.
So how does it work and what can you do about it? The answers are unfortunately not very clear.
How Proxying Works
Most of the time content is copied on the Web, it is done so in a very traditional fashion. It is copied from one location, either by a bot or a human, and pasted to another. This means that two versions of the data exists on the Web, one on your site and one on the second site.
A common analogy would be if you had a text file on your computer, copied the content, created a new file and saved the data to it, thus creating two copies of the data on your machine. Then, if you do a search for a keyword on your computer looking for a phrase in the file, two different files will come up, much like how two different sites come up in Google.
Proxying, however, is different. With it the “copy” site doesn’t physically host the data. Rather, they host a link to it. They load the content from your server, your server providing the information as if the proxy were any other visitor, and then manipulate the data as they see fit.
This means that every time a visitor on their site loads a page with your information on it, their site visits yours, extracts the data and then presents it.
Working with the same analogy, it is as if you created a Word file that, instead of copying the information, simply linked to it, causing the text in the original file to come up on load but the file itself only had a pointer to the data, not the actual information. Those opening the file would not be able to tell the difference and, depending on how you search for the data, your computer might not either. However, the data would not exist within the second file itself.
Proxying is something very different from traditional copying and has more in common with framing than scraping or plagiarism. However, it carries with it many of the same potential risks as traditional copying, including visitor confusion, duplicate content and more.
Why Proxying Is a Worry
Proxying has been around for many years however, it has most commonly been used for anonymous surfing than anything else. Services such as Anonymouse have provided users with a relatively high level of anonymity when surfing the Web.
The idea is pretty simple. Since Anonymouse, or any other proxy service, is loading the Web page for the visitor, the sites they go to only see the information for Anonymouse, not the person loading the site. This makes it harder for a visitor to be tracked.
However, the problem is not with services such as Anonymouse, which use proxying in a responsible manner, but those who are reckless about it.
Proxying itself is not evil, but if it is done maliciously or negligently, it can cause Webmasters a great deal of grielf.
Consider the following issues:
- Search Engine Troubles: As the Workfriendly case showed, search engines cannot tell the difference between a page that is hosted on a server or one that is proxied. They will routinely index a proxied page unless the operators of the proxy take proper precautions. This can lead to duplicate content issues or even bump original sites into the supplemental index.
- Bandwidth Concerns: Every pageview on the proxy site causes a page to be downloaded form the original. This might not be a major concern for most sites, but if a proxy site gets a great deal of traffic, it can cause Webmasters to spend a great deal of bandwidth feeding the visitors of the proxy site.
- Manipulation: Since the proxy site controls what the visitor actually sees, they can do anything they want with the content before it reaches their eyes. In the case of Workfriendly, they manipulated the content so it appears to be a Word document, Tynt manipulates it to add their users stickers and bubbles. Less ethical sites could add their own advertisements, attribution or completely change the information presented.
- Copyright Issues: Though framing is typically thought of as a form of copyright infringement, making it likely proxying could be too, with no files hosted on the server, there is no means of filing a DMCA notice. This issue routinely came up with those that tried to stop Workfriendly, as the server only had their home page and the script to perform the proxy.
Again, this is not to say that all proxying or even the companies mentioned, are evil, just that it has the potential for widespread abuse and it is only a matter of time before we see more malicious proxying.
The question then becomes, what can be done about it. As I see it, there are two different angles of attack that must be considered.
Preventing Bad Proxying
As a Webmaster, preventing proxying of your content is a difficult matter, especially since most bloggers don’t have the necessary experience or tools.
However, if this issue concerns you greatly, I would recommend the following steps:
- Host Your Own Domain: Having control over your own server allows you to block IP addresses and prevent anyone from accessing your site that you don’t want to. This includes proxy services. If any site bothers you, you can keep them away at the gate easily.
- Link To Yourself: Make it clear a visitor what site they should be on. Mention the URL of your site regularly and link to yourself when practical. Not only is it good marketing in many cases, but it also lets anyone who is on a proxy know your actual location. May help with search engines as well. Do be warned that many proxy services will manipulate links to keep their users within their site.
- Contact Hosts/Webmasters: Though you can not use DMCA notices in the case of proxy services, nothing stops you from writing the administrator and asking to be removed from their service or using any opt out tool they provide. Failing that you can still contact their host with a generic abuse report. There is a good chance such proxying is a violation of their terms of service.
But as effective as these steps can be, I feel strongly that the best approach is not to block proxy services, but to encourage ethical proxying. After all, such services can provide a valuable resource to users and, if done well, can actually help Webmasters.
To that end, I would recommend the following guidelines:
- No Search Engine Indexing: Any and all pages produced through proxying should not be indexed by the search engines. Though the use of Robots.txt doesn’t seem to have kept Tynt pages out of Google, proxy services should do everything that they can to prevent such indexing.
- Clear Exit: The proxy should make it clear to the user that A) They are a proxy site and that this is not the original page and B) Provide a clear means for the visitor to go to the original site.
- No Advertising: Sites that display other pages via a proxy should not display advertising next to other people’s content. Though it is a line that is admittedly vague legally, it is one that a lot of people have an issue with personally and professionally. Any proxy service that wishes to avoid the wrath of Webmasters should pay attention.
- Limited Manipulation: The proxy should change as little about the site as possible and should never change the meaning or bypass access controls. Areas such as filtering advertising are a gray area, but effort should be taken to ensure that the visitor viewing through the proxy gets the same experience as one visiting the site directly.
- Clear Opt Out: If a site does not wish to have its pages displayed by the proxy, they should be able to opt out without any difficulties or questions asked. Such opting out should not be predicated on robots.txt or even meta tags due to Webmaster limitations on many free hosts.
The bottom line is that the proxy should behave as transparently as possible, something most proxies do but many in the future will not.
At this moment, proxying is not a major issue. The bandwidth and other resources required to maintain a proxy on a large scale outweigh the benefits to spammers and others that might want to maliciously use the technique.
However, as copyright issues grow and bandwidth gets cheaper, it is a near certainty that some people will start to look at this as an alternative to traditional scraping.
The hope is that both the law and the technology will have caught up to the technique by then, making it less effective and easier to stop.
That is not likely, but there does seem to be some effort on that front.