Back March, Cloudflare introduced a new product, ScrapeShield, that is designed to detect when scrapers have grabbed your content and where else it appears.
Initially, I was very skeptical of the idea. There have been many systems out there designed to detect scraping and other copying and none of them have really worked. The reason is that all of the services have required the scraper to grab a beacon of some type, usually a small image, that is then tracked back to where it is republished.
While this sounds great, the problem is that scrapers are often very selective about what they grab and republish, routinely stripping out images or skipping them altogether. If the trackers aren’t scraped or aren’t republished, the copying is undetected. Generally, it’s much easier to block scrapers and track text content through the use of statistically improbable phrases or, in other cases, digital fingerprints.
Still, I wanted to give Scrapeshield a fair chance so I set up Cloudflare on my site and ended up letting it run for several months (I had only intended for a few weeks, but hurricanes, illnesses and other factors pushed back the testing).
So what did I learn after several months of using ScrapeShield? Not a whole not. In fact, I didn’t see a great deal that I couldn’t easily have done myself.
How ScrapeShield Works
Cloudflare (previous coverage), basically, is a content delivery network. That means it has a large number of datacenters all around the world that serve content to visitors from the closest endpoint, speeding up traffic.
If you install Cloudflare on your site, basically it sits between you and your visitors delivering static content, such as your images, to your visitors from its networking and only asking your server for things it doesn’t have. This reduces the resources used on your server and, for your visitors, speeds up your site.
However, it has an additional benefit in the form of protection. Bots, such as scrapers, follow the same path as your human readers and don’t visit Cloudflare’s servers first. That gives Cloudflare an opportunity to block them before they can access your site. This power is increased by the fact CloudFlare provides its services to so many domains, giving it a great deal of information on which users are bots, scrapers, spammers, etc. This is something that CloufFlare has always aimed to do.
ScrapeShield, basically, is an extension of that. In addition to attempting to block scrapers, Scrapeshield inserts tracking “beacons” into your content to understand where it appears when it is copied and it also has tools to make it easy to block Pinterest, obfusicate your email address and stop image hotlinking.
You also have the option of joining Cloudflare’s community effort “Maze”, which aims to send scrapers down a labyrinth of garbage sites that slows down their data collection efforts and wastes their resources.
However, there’s no real way for me (without engaging in widespread scraping) to test Maze. Furthermore, since bot blocking was already a part of Cloudflare’s default product, the only thing to really test was the content tracking. There, for me at least, the results were less than impressive.
Since Scrapeshield reports only track the last 25 suspect URLs, I only have data back to November 23rd. However, looking at those 25 cases, there wasn’t a great deal interesting. All totalled, 18 of those scrapes were actually either Google cache visits or Google translations, 6 were from Archive.org and the last remaining one was from another translation site, Citcat.com.
In the end, ScrapeShield found zero infringements.
This story held true every time I checked in on the results, it found very few URLs and, the ones it did find, were always legitimate. This is likely in large part due to CloudFlare successfully blocking many of the worst scrapers, but it’s also partly due to the fact that CloudFlare’s tracking technique is far from perfect and it doesn’t always track the most important content.
In short, ScrapeShield is not likely going to be a major revolution in tracking content misuse nor is it going to replace other tools. In fact, it likely won’t find anything at all.
The Problem with ScrapeShield
All in all, there are three major problems with CloudFlare’s ScrapeShield when it comes to its tracking:
- Image-Based Tracking: ScrapeShield works by peppering your content with small images. However, most scrapers already ignore images as text is much more valuable for SEO and hotlinking can get a scraper caught even without ScrapeShield.
- No RSS Protection: The tracking beacons only appear in HTML documents on your site and not in the RSS. Sadly, the RSS feed is the most popular target for content scraping for most blogs and that leaves a major hole.
- Can’t Catch Most Humans: Humans who visit your site and grab your content will still not be tracked. Though human copying wasn’t the target of ScrapeShield, it’s the biggest threat many sites face.
What all of this adds up to is simple. ScrapeShield, even at its best, is only going to track a small percentage of the copying that takes place on your site. Given how much of it CloudFlare is capable of outright blocking, you’re most likely going to see a lot of legitimate uses of your content and few infringements.
In fact, the only case I’ve seen online of someone successfully tracking a scraper involved a site that scraped the headline of a post for a caption
However, one of the engineers at CloudFlare was able to provide several examples of his blog’s content being tracked successfully by ScrapeShield so, obviously, your mileage may vary.
Still, even if it is very effective for you, ScrapeShield is not a replacement for other methods of tracking content. There are going to be a lot of blind spots in the service no matter what.
None of this is to say you shouldn’t use CloudFlare or ScrapeShield. Both are free and CloudFlare has a good track record of blocking scrapers and other ill-behaving visitors. It’s a great service that almost any site can benefit from with very little risk.
But that being said, ScrapeShield is not a revolution in content tracking as it was hyped up to be. The method it employs has been used by other tools before. In fact, the technique has been used ever since the first time a scraper was caught by accidentally hotlinking a bunch of images from its victim.
Can it work? Yes. Is it going to catch everything? Not nearly. In the game of cat-and-mouse between scrapers and trackers, the technique is several iterations ago. Most of the sites that will get caught in the net are legitimate ones not trying to hide their activities. Even a mediocre scraper will, most likely, avoid this trap.
So, by all means, use ScrapeShield and CloudFlare but understand the limitations of them, especially in terms of content tracking.
Instead, focus on how CloudFlare helps to block bots rather than how it tracks them as that seems to be what it is much better at.