workFRIENDLY: An Accidental Scraper

Jonathan BaileyNovember 9, 2007

4 minutes read

On the surface, workFRIENDLY is something of a novelty site.

The idea is pretty simple, you punch in a URL that you want to visit and workFRIENDLY pulls up the site in a format that resembles a Microsoft Word document (see Blog Herald on workFRIENDLY). The idea is that, if you use the site to surf the Web while at work, it will be less suspicious than having a regular browser open should your boss walk by.

However, the simple and somewhat tongue-in-cheek nature of the site belies a potential threat to Webmasters. A Google search of the domain reveals hundreds of thousands of indexed pages, only one of which, the home page, is original in its content. The rest are cached versions of the pages entered into the system.

The result is that workFRIENDLY, probably by accident, has become one of the most prolific scraping sites I’ve seen and certainly one of the best at getting their results listed in Google.

What is Going On

Fundamentally, workFRIENDLY is little different than other proxy services on the Web including Anonymouse and even Google cache. What separates workFRIENDLY from these services is that it modifies the look of the page so that it appears to be in the format of a Word document.

This modification of the text raises copyright questions of its own, especially in light of the Google cache ruling, which hinged in part on Google lack of modification of the original site, and section 512(b)which protects transitory services from infringement claims for hosting infringing material so long as the data is “transmitted through the system or network without modification of its content.”

However, for most Webmasters, it is likely to be a technical issue that raises the most concern.

The problem is that most sites like workFRIENDLY use a robots.txt file to block search engines from indexing the pages viewers pull up. For example, Anonymouse has a robots.txt file that blocks access to their cgi-bin directory, which is where all anonymous browsing takes place. Likewise, Google has a robots.txt that blocks the search directory, which is where it displays its cached copies.

However, workFRIENDLY does not have such a robots.txt file, in fact, as of this writing, the site doesn’t have one at all. This has caused search engines to index virtually everything it can get its hands on, including over a quarter of a million pages according to Google’s admittedly flawed estimate and over one million in Yahoo.

Though not everything workFRIENDLY has ever visited has been indexed, it appears that a good percentage of it has and that the site is not doing anything to stop it. Worse still, some of these pages have started turning up in organic search results, especially for more obscure keywords, and some Webmasters are getting upset about it.

What To Do About It

According to the person who brought this to my attention, who wishes to remain anonymous, she attempted to contact the host of the site, WebHost4Life, but was told that the data used to create the cached copies was stored elsewhere.

However, after further investigation, it appears that workFRIENDLY doesn’t use cached copies at all. Rather, the pages are created dynamically with each visit. I was able to test this by visiting their copy of Plagiarism Today and refreshing the page every few minutes to see if the time of the workFRIENDLY page had changed, even though my site had not.

If the site had used a cache, the timestamp would have stayed the same, as you can see in the images below, it did not.

First capture:

Second capture:

This creates a strange problem in that there is no content that the host of the site can takedown. The site itself is little more than the homepage and the needed scripts to pull down the data for display.

With no page to take down, the host is in a difficult situation. Though they can disable the whole domain, there is no easy way for them to remove just the infringing work. Worse yet, there is no way to simply block workFRIENDLY using meta tags or robots.txt as there is no documentation for their spider.

Instead, the focus shifts to blocking workFRIENDLY from accessing your site, which I was able to achieve using the plugin WP-Ban and the IP address of the site.

You can also edit your .htaccess file to achieve the same effect.

If that is not practical and you find that your search engine results are threatened or usurped by the duplicate content, you can always file a DMCA notice with Google to get the pages removed. You can also report spam results if you see workFRIENDLY urls in the results of other searches you perform.

Though not an ideal solution, it is at least a workable one and can serve as a stopgap until a more permanent answer, or at least some decisive action from Google, takes place.

Conclusions

I want to be clear that I do not think workFRIENDLY is doing anything malicious and, in truth, it may not even be illegal. I believe that they set up this site never expecting these results to be indexed. There is currently no advertising on the site, save the home page, and the site has hardly achieved what one would call great success with the search engines.

Unfortunately, a letter to workFRIENDLY went unanswered for quite some time so I do not have any word from them on this matter.

If the site would simply create a two-line robots.txt file that blocked indexing of the “browse” directory, the whole matter would be resolved in a very short period of time. However, as of this writing, they have not done that.

The result is that this site has become not just a prolific scraper, but one of the most difficult ones to deal with.

But what is worse about this case is the black eye that this gives Google when it comes to dealing with duplicate content. The fact that a site such as workFRIENDLY can, without any real effort, push hundreds of thousands of duplicate pages into the search results is very worrisome.

Hopefully Google can get around to fixing this issue soon. We have enough to worry about with the malicious spammers to spend too much time on those who simply make a mistake.

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free

Jonathan BaileyNovember 9, 2007

4 minutes read

Want to Reuse or Republish this Content?

Follow us