Last year, Copyscape introduced a feature called “Batch Search” that allows users to automatically scan for duplicate content on up to 10,000 URLs at once. It is a powerful tool that is designed to scan an entire site, or even a series of sites, for pages that they might have either plagiarized from or are being copied by others.
The idea is simple enough, you put in the URLs that you want to search, purchase the number of credits you need to scan them and then grab a cup of coffee (or a good night’s rest) while Copyscape churns away at the URLs, finding all relevant matches.
I missed a chance to review the tool when it was first launch but recently had an opportunity to spend some serious time with it. As such, I can now offer my belated opinion on the service and what it might be able to do for Webmasters.
On that front, it is easy to see why this would be a popular tool, especially useful for one-time checks of a site, but does it hold up and does it actually make the process of checking for copied content faster? The answer seems to be yes, but with some caveats.
How it Works
The process of using CopyScape Batch Search is fairly straightforward. There are three ways you can put URLs into the system. First, you can have Copyscape scan a single URL and collect the links from it, you can copy and paste a list of URLs with one on each line or you can simply upload a sitemap.
Whichever system you choose, Copyscape will place the URLs into a list, filtering for exact duplicates (though you may have to do some manual filtering of pages with the same text but different URLs) and it will tell you how many credits you need to buy to run the search. Once you’ve paid up, you simply submit the search.
The batch search moves at a rate between 150-200 URLs per hour in my experience. For a smaller batch, under a hundred, you could probably run the search over a lunch break, larger ones, over a thousand, may require you to run it overnight or even over several days.
Once the results are complete, they are provided in a relatively plain list. You get the URL, the title of the original page, the number of matches found and a color-coded “Risk” indicator that indicates how much of the content has been copied (not the number of potential matches).
From there, you have to analyze the results by hand, as you should with any plagiarism scan, and make your determinations about whether Copyscape’s findings were justified.
First off, the process of getting the URLs into Copyscape and performing the search is extremely simple and efficient. I ran a series of batches through, including one very large one, and was able to enter all of them without any issue. Though I couldn’t use the sitemap feature directly for various reasons with the different batches, I was able to extract links successfully using the URL parser and the list feature with ease.
Though the generation of results wasn’t blindingly fast, it was definitely faster than it would have been to perform the searches by hand. In every test, Copyscape beat its estimated time of completion by about 10% and the results were available well before I was expecting them.
The results themselves were also simple and straightforward. Clicking on the URL in the main list takes you to a list of suspected copies, that list will then take you to the Copyscape match page where you can see the similar text highlighted. You can also view the text on your site, meaning the URL that you checked, for comparison.
Best of all, they, unlike many Copyscape results pages, remain in your account after you’re done reviewing them. This way, you don’t have to inspect all the URLs at once and can, instead save your work and come back later. All you have to do is access the “Batch Results” section under Copyscape Premium.
In short, it is a straightforward system that gets the job done in very short order. Much like with the rest of Copyscape, it is meant to be as simple as possible while remaining very powerful and efficient to use.
However, there are some hangups that limit the service’s usefulness and hinder your search, especially when doing the human analysis.
The biggest problem with Copyscape Batch Search stems from organizing and parsing the results and it can be a major drain on your time when you are performing very large searches.
First, there is no way to sort or filter the list. The list is a completely static one and the URLs are presented in the order that they were processed. You can not, for example, put the highest risk matches at the top or only display results with more than X number of matches.
Second, the choice of a color-coded risk indicator seems handy, but to people such as myself that have a difficult time distinguishing colors, it can be a pain. Anyone who is truly colorblind would find this service almost unusable and even I found it difficult to distinguish between some of the colors as most exist on a spectrum between yellow and red with a lot of shades of orange in between. A numeric/color hybrid system would be much more useful.
However, the risk indicator had an even bigger problem in my tests in that it was, on several occasions, wrong. Though “red” risk URLs always had a high match content, I found many light yellow ones that also had a high percentage of matching content. Though those were rare cases overall, it meant I had to check every single URL by hand to make sure Copyscape had not missed anything glaring.
Going through a very large batch, especially if there is a decent number of pages with matching text, is going to be a tedious process and one may be better served by creating their own spreadsheet to organize the results better.
But despite these shortcomings, there are many situations where Copyscape Batch Search may be useful and is still well worth considering.
The ideal use for this product is a one-off check of a large number of URLs. However, situations where that might be appropriate are few and far between.
If you have a need to continuously monitor your site for copying, Copyscape Batch Search will be both far too expensive and far too time-consuming. You would be much better off either with Copyscape’s Copysentry service or a service like Attributor. Both will provide better case management and will save you money over the long-term.
However, if you need to do a one-time audit of a site, either to see how much content reuse is taking place or if a body of work contains plagiarized material, it can definitely help. Still, I don’t think Copyscape Batch Search is very practical for large batches. You will likely be suited to break apart any large job into smaller, more manageable sections so you can go through it and organize it better. Without sorting or filtering, a list of 10,000 URLs would become a nightmare.
Still, if you have a use for this service, I would not shy away from it. At five cents per search it is a very solid deal overall and Copyscape has always performed well in plagiarism detection, even taking top honors recently in a head-to-head competition.
It’s a great service, just with a very limited use.
If you have need for what Copyscape Batch Search does best, I’d use it. I recommend it highly for what it does and I plan to use it again in the future as the need arises. Overall, despite its shortcomings, it is a great deal and a solid service.
Best of all, I think the problems that it faces could be fixed with an update as none of them are related to the core process or service.
All in all though, it is an interesting service, even if it has only a limited use. Though it is unlikely most will ever need it, those who do will certainly be glad it is available.