workFRIENDLY: An Accidental Scraper

By Jonathan Bailey • Nov 9th, 2007 • Category: Articles, DMCA, Legal Issues, News, Prevention

On the surface, workFRIENDLY is something of a novelty site.

The idea is pretty simple, you punch in a URL that you want to visit and workFRIENDLY pulls up the site in a format that resembles a Microsoft Word document (see Blog Herald on workFRIENDLY). The idea is that, if you use the site to surf the Web while at work, it will be less suspicious than having a regular browser open should your boss walk by.

However, the simple and somewhat tongue-in-cheek nature of the site belies a potential threat to Webmasters. A Google search of the domain reveals hundreds of thousands of indexed pages, only one of which, the home page, is original in its content. The rest are cached versions of the pages entered into the system.

The result is that workFRIENDLY, probably by accident, has become one of the most prolific scraping sites I’ve seen and certainly one of the best at getting their results listed in Google.

What is Going On

Fundamentally, workFRIENDLY is little different than other proxy services on the Web including Anonymouse and even Google cache. What separates workFRIENDLY from these services is that it modifies the look of the page so that it appears to be in the format of a Word document.

This modification of the text raises copyright questions of its own, especially in light of the Google cache ruling, which hinged in part on Google lack of modification of the original site, and section 512(b)which protects transitory services from infringement claims for hosting infringing material so long as the data is “transmitted through the system or network without modification of its content.”

However, for most Webmasters, it is likely to be a technical issue that raises the most concern.

The problem is that most sites like workFRIENDLY use a robots.txt file to block search engines from indexing the pages viewers pull up. For example, Anonymouse has a robots.txt file that blocks access to their cgi-bin directory, which is where all anonymous browsing takes place. Likewise, Google has a robots.txt that blocks the search directory, which is where it displays its cached copies.

However, workFRIENDLY does not have such a robots.txt file, in fact, as of this writing, the site doesn’t have one at all. This has caused search engines to index virtually everything it can get its hands on, including over a quarter of a million pages according to Google’s admittedly flawed estimate and over one million in Yahoo.

Though not everything workFRIENDLY has ever visited has been indexed, it appears that a good percentage of it has and that the site is not doing anything to stop it. Worse still, some of these pages have started turning up in organic search results, especially for more obscure keywords, and some Webmasters are getting upset about it.

What To Do About It

According to the person who brought this to my attention, who wishes to remain anonymous, she attempted to contact the host of the site, WebHost4Life, but was told that the data used to create the cached copies was stored elsewhere.

However, after further investigation, it appears that workFRIENDLY doesn’t use cached copies at all. Rather, the pages are created dynamically with each visit. I was able to test this by visiting their copy of Plagiarism Today and refreshing the page every few minutes to see if the time of the workFRIENDLY page had changed, even though my site had not.

If the site had used a cache, the timestamp would have stayed the same, as you can see in the images below, it did not.

First capture:

Second capture:

This creates a strange problem in that there is no content that the host of the site can takedown. The site itself is little more than the homepage and the needed scripts to pull down the data for display.

With no page to take down, the host is in a difficult situation. Though they can disable the whole domain, there is no easy way for them to remove just the infringing work. Worse yet, there is no way to simply block workFRIENDLY using meta tags or robots.txt as there is no documentation for their spider.

Instead, the focus shifts to blocking workFRIENDLY from accessing your site, which I was able to achieve using the plugin WP-Ban and the IP address of the site.

You can also edit your .htaccess file to achieve the same effect.

If that is not practical and you find that your search engine results are threatened or usurped by the duplicate content, you can always file a DMCA notice with Google to get the pages removed. You can also report spam results if you see workFRIENDLY urls in the results of other searches you perform.

Though not an ideal solution, it is at least a workable one and can serve as a stopgap until a more permanent answer, or at least some decisive action from Google, takes place.

Conclusions

I want to be clear that I do not think workFRIENDLY is doing anything malicious and, in truth, it may not even be illegal. I believe that they set up this site never expecting these results to be indexed. There is currently no advertising on the site, save the home page, and the site has hardly achieved what one would call great success with the search engines.

Unfortunately, a letter to workFRIENDLY went unanswered for quite some time so I do not have any word from them on this matter.

If the site would simply create a two-line robots.txt file that blocked indexing of the “browse” directory, the whole matter would be resolved in a very short period of time. However, as of this writing, they have not done that.

The result is that this site has become not just a prolific scraper, but one of the most difficult ones to deal with.

But what is worse about this case is the black eye that this gives Google when it comes to dealing with duplicate content. The fact that a site such as workFRIENDLY can, without any real effort, push hundreds of thousands of duplicate pages into the search results is very worrisome.

Hopefully Google can get around to fixing this issue soon. We have enough to worry about with the malicious spammers to spend too much time on those who simply make a mistake.

Jonathan Bailey is The Webmaster and author of Plagiarism Today, which he founded in 2005 as a way to help Webmasters going through content theft problems get accurate information and stay up to date on the rapidly-changing field. He is also a consultant to Webmasters and companies to help them devise practical content protection strategies and develop good copyright policies.
Email this author | All posts by Jonathan Bailey

32 Responses to “workFRIENDLY: An Accidental Scraper”

  1. Wow! I do hope that both workFRIENDLY and Google read this post and do something fast. Knowing Google’s record of the recent past, that is expecting too much perhaps, but the simple solution that you have suggested to the former should enable them to do something quickly.

  2. Wow! I do hope that both workFRIENDLY and Google read this post and do something fast. Knowing Google’s record of the recent past, that is expecting too much perhaps, but the simple solution that you have suggested to the former should enable them to do something quickly.

  3. JB says:

    RS: I doubt workFRIENDLY will do anything given that they didn’t even respond to this article for a few weeks and it doesn’t seem as if Google has any interest in this either, even though it is their database being flooded with duplicate data.

    I don’t see much hope for an easy resolution here.

  4. JB says:

    RS: I doubt workFRIENDLY will do anything given that they didn’t even respond to this article for a few weeks and it doesn’t seem as if Google has any interest in this either, even though it is their database being flooded with duplicate data.

    I don’t see much hope for an easy resolution here.

  5. Recliners says:

    On the face of it, workfriendly sounds like something that simplifies the surfing process for you,but it seems to have far more ominous connotations than a novelty site.

  6. Recliners says:

    On the face of it, workfriendly sounds like something that simplifies the surfing process for you,but it seems to have far more ominous connotations than a novelty site.

  7. A. Marques says:

    Hi,

    But if workFriendly dynamically generates each page instead of using a cache, is there anything even to index? I’m not quite getting it.

    And talking about scrappers, there is a multitude of sites that are scrapping just a few lines of my content (probably enough to get some keywords) with a nofollowed link to the post. Are they breaking any copyright law with this? Have you suffered the same?

  8. A. Marques says:

    Hi,

    But if workFriendly dynamically generates each page instead of using a cache, is there anything even to index? I’m not quite getting it.

    And talking about scrappers, there is a multitude of sites that are scrapping just a few lines of my content (probably enough to get some keywords) with a nofollowed link to the post. Are they breaking any copyright law with this? Have you suffered the same?

  9. JB says:

    Recliners: Agreed. It is an interesting idea, just a flawed executions.

    A. Marques: The way it works is like this. Though workFRIENDLY creates everything dynamically, the URLs are static.

    For example, if you visit this page: http://www.workfriendly.net/browse/Office2003Bl...

    You’ll always see their version of this site. If someone links to it, the search engine will pick up that link.

    Everytime someone visits that link, be it a human or a search engine, workFRIENDLY visits my site, pulls down the content and formats it into their page. So, even though no actual page exists on their servers, to the search engines, it’s real.

    In that regard, it works a lot like Wordpress or any other dynamic CMS. The pages don’t actually exist, but are created dynamically by combining information from a database with a template. If you look at the server, no directory or files exist, the pages are just made by the server on the fly.

    By the by, it appears that the banning technique I talked about is not working any longer. I am going to examine my log files this afternoon and see what is going on.

  10. JB says:

    A. Marques: I failed to answer your second question, what you’re seeing are search engine or watchlist scrapers. They either scrape relevant search engine results or subscribe to blog watchlists and scrape those.

    The copyright situation of these is much more dubious as there is a much stronger fair use argument to be made, I’ve seen lawyers and experts come down on both sides of the issue.

    There is no easy answer to your question but it is very common and becoming much more so as the years move on.

  11. JB says:

    Recliners: Agreed. It is an interesting idea, just a flawed executions.

    A. Marques: The way it works is like this. Though workFRIENDLY creates everything dynamically, the URLs are static.

    For example, if you visit this page: http://www.workfriendly.net/browse/Office2003Blue/www.plagiarismtoday.com

    You’ll always see their version of this site. If someone links to it, the search engine will pick up that link.

    Everytime someone visits that link, be it a human or a search engine, workFRIENDLY visits my site, pulls down the content and formats it into their page. So, even though no actual page exists on their servers, to the search engines, it’s real.

    In that regard, it works a lot like Wordpress or any other dynamic CMS. The pages don’t actually exist, but are created dynamically by combining information from a database with a template. If you look at the server, no directory or files exist, the pages are just made by the server on the fly.

    By the by, it appears that the banning technique I talked about is not working any longer. I am going to examine my log files this afternoon and see what is going on.

  12. JB says:

    A. Marques: I failed to answer your second question, what you’re seeing are search engine or watchlist scrapers. They either scrape relevant search engine results or subscribe to blog watchlists and scrape those.

    The copyright situation of these is much more dubious as there is a much stronger fair use argument to be made, I’ve seen lawyers and experts come down on both sides of the issue.

    There is no easy answer to your question but it is very common and becoming much more so as the years move on.

  13. valerie says:

    Wow, just wow… I don’t have anything to add to the conversation, but I’m subscribing ;-)

  14. JB says:

    Valerie: Welcome aboard! Let me know if I can help in any way.

    Also, visited your site, how are you liking BlogRush?

  15. valerie says:

    Wow, just wow… I don’t have anything to add to the conversation, but I’m subscribing ;-)

  16. JB says:

    Valerie: Welcome aboard! Let me know if I can help in any way.

    Also, visited your site, how are you liking BlogRush?

  17. Valerie says:

    Sorry, I missed your reply earlier! Hi again :)

    BlogRush is okay I guess. I am not sure I’ve seen any real benefit from it yet and the widgets are kind of huge and ugly. I’ll give it some time, though, before I give up on it! :)

  18. Valerie says:

    Sorry, I missed your reply earlier! Hi again :)

    BlogRush is okay I guess. I am not sure I’ve seen any real benefit from it yet and the widgets are kind of huge and ugly. I’ll give it some time, though, before I give up on it! :)

  19. JB says:

    Valerie: No problem. Your views pretty much mirror what I’ve read elsewhere. I was tempted by it but I can’t find anyone who has benefited from it.

    That being said, one neat trick I have learned is Commentful. It’s a great way for tracking comments on the Web, I’d go insane without it and it’s my secret weapon for keeping track of where I posted to.

    Thanks for the info!

  20. JB says:

    Valerie: No problem. Your views pretty much mirror what I’ve read elsewhere. I was tempted by it but I can’t find anyone who has benefited from it.

    That being said, one neat trick I have learned is Commentful. It’s a great way for tracking comments on the Web, I’d go insane without it and it’s my secret weapon for keeping track of where I posted to.

    Thanks for the info!

  21. kev@seoibiza says:

    hey. nice article, have been wondering whats the deal with this site for a while now, as it appears people are looking at our site with it a lot and the pages show up in the serp all over the place.

    we also emailed them about it a while ago and heard nothing. also AVG’s latest browser linkscanner now shows the site as a phishing threat.

    it needs sorting out.

  22. kev@seoibiza says:

    hey. nice article, have been wondering whats the deal with this site for a while now, as it appears people are looking at our site with it a lot and the pages show up in the serp all over the place.

    we also emailed them about it a while ago and heard nothing. also AVG’s latest browser linkscanner now shows the site as a phishing threat.

    it needs sorting out.

  23. @kev@seoibiza -
    Kev, I was unaware that it is now being called a phishing threat though I hardly find that surprising. It seems to me that the more I learn about this site, the more suspicious I get. Thank you for the update!

  24. Accidental or not, these Office2003blue pages show up in my webmaster tools pages as 404 not founds as if they were missing pages on my site, which can only be a problem in terms of SEO and keeping Google happy. Someone ought to launch a campaign to get this workfriendly system banned…the sooner, the better.

    db

  25. Accidental or not, these Office2003blue pages show up in my webmaster tools pages as 404 not founds as if they were missing pages on my site, which can only be a problem in terms of SEO and keeping Google happy. Someone ought to launch a campaign to get this workfriendly system banned…the sooner, the better.

    db

  26. @David Bradley -
    You might be interested in my updated post on WorkFriendly from this year. It details the 404 error issue, what causes it and how to block them from your site.

    http://www.plagiarismtoday.com/2008/04/08/workf...

    Feel free to email me if you have continued problems or if I can help in any way!

Trackbacks/Pingbacks

  1. [...] system. plagiarismtoday.com has a great article with more information on the topic that can be read here [...]

  2. [...] visiting!Back in November of last year, I wrote an article about Workfriendly, calling it an “accidental scraper” and accusing the site of allowing search engines to index pages containing scraped [...]

  3. [...] a site previously reported on Plagiarism Today back in November 2007 and again in April of this year, stopped functioning sometime within the past few days, bringing an [...]

  4. [...] I want to make it clear that I do not think Feedblitz is doing this on purpose. The search results that this content will likely rank for does little to help them with their business. I don’t think Feedblitz is trying to be a spammer any more than Workfriendly was. [...]

Leave a Reply