For my day job, I work as a copyright and plagiarism consultant at my firm CopyByte
One of the most important parts of that job is performing plagiarism analyses. These are done for a variety of reasons including providing expert witness testimony for court cases, verifying authenticity of a large project, checking a website before sale or even examining a book before publication.
But when I provide a quote for a plagiarism analysis I usually get one of two responses (something many consultants and freelancers can relate to). Either the work is “ridiculously expensive” or “insanely cheap” depending on the person’s perspective.
However, those in the former camp often think that running a plagiarism analysis is as simple as taking the work and running it through a plagiarism checker. Considering the cost of an automated plagiarism check ranges from a few pennies to a few hundred dollars, they wonder why my cost is often thousands in many cases.
The reason is that a plagiarism analysis is a time-intensive process. It isn’t simply a matter of pasting a work into a computer and getting a result. There’s a reason there are wikis that crowdsource plagiarism analyses or that universities can take months to complete a plagiarism examination. There’s a lot of work that goes into checking a work, especially a large work, for plagiarism.
To help others understand what is involved, I’m outlining the typical steps I have to go through when performing a plagiarism analysis on a work, in particular a work over 50,000 words long.
Two Types of Plagiarism Analyses
It’s worth noting that there are two kinds of plagiarism analyses. The first is when you’re comparing two or more known works against each other and the second is when you’re trying to determine the originality of an unknown work by comparing it against as much of the world you can.
Both kinds of plagiarism analyses have their challenges, but, generally, it’s easier to start with a handful of known works. This means that some reason to be suspicious has already been discovered and my objective is to see if that suspicion is valid. Since this analysis will happen offline, it’s both faster and cheaper to perform plus we can customize the tools we use to get the best results possible from the automated analysis.
But in both kinds of analyses, a plagiarism checker tool is invaluable. Without the technology, it’s unlikely that this job would even be possible on the scale I have to do it. However, all plagiarism checkers have the same limitation: They detect matching words, not actual plagiarism.
This not only leaves a lot of potential plagiarism behind, but can cast doubt on common phrases, quoted passages or other innocent overlaps. A human has to go through and separate what is just overlapping text and what is actually concerning.
To do that, however, is a multi-step process that can require days or even weeks of work.
Preparation of Source Materials
The first step in any plagiarism analysis is preparing the materials to be checked for the software. Plagiarism checkers require documents with machine-readable text (such as Docx, RTF, etc.) and work best if that text is cleaned up and formatted correctly.
For some analyses there’s not much prep that needs to be done. However, in some cases this can actually be the brunt of the work, especially when the document has to be scanned and turned into clear text through optical character recognition (OCR). In some cases, this can take longer than the actual analysis, especially when parsing multiple books.
Still, for most cases dealing with modern materials, there’s readily-available clear text versions of the works at issue. This makes it easy and means we only have to do minor touch ups on the documents to make sure that it’s as ready possible for analysis.
Perform the Automated Analysis
After preparing the documents, the next step is to actually perform the automated analysis.
The most difficult part here is selecting which tool to use for the job as every plagiarism checker has their strengths and weaknesses. However, we usually select the tool before starting the project so, when actually performing this check, this is usually just a matter of actually setting up and running the software.
Generally, very little time goes into this part but, with some applications, it may require multiple checks to get the settings on the software right to minimize false results.
Analyze the Results
Once run through the software, what we have isn’t a report of all of the plagiarism in the work, but of all the duplicative text that the checker found.
Even in a work completely free of plagiarism, we expect a certain percentage to be marked by a plagiarism checker. A 0% finding is often just as suspicious as a 100% finding.
Because of that, it’s important to go through the work and look at the passages involved. Longer passages obviously get attention first simply because they are the easiest to either dismiss or confirm as plagiarism. However, shorter passages have to be checked too because they can indicate poor paraphrasing or an attempt to mask copying.
When it’s all said and done the passages that are found are put into one of four categories.
- False Positives: No plagiarism checker is perfect and they sometimes return false positives that are either super-short passages with no meaning or otherwise not even examples of common phrasing. This can often be caused by formatting errors in the source documents or overzealous detection.
- Common Phrasing/Quotations: Cliches, common sentences, phrases, titles, quotes and other passages of text can be picked up by plagiarism checkers even though they aren’t an issue. These aren’t false positives, this is text that does match but they don’t represent any plagiarism or attribution issues.
- Poor Paraphrasing/Rewriting: Poor paraphrasing in a work is usually highlighted by a plagiarism checker as a series of short copied passages. It’s very difficult to edit existing text enough to avoid detection, this is why writing in a cleanroom is so important.
- Verbatim Plagiarism: Finally, we come to the verbatim or near-verbatim plagiarisms. These are obviously the most egregious issues we can find but these are situations where the text is clearly copied (possibly with minor modifications) and not properly cited.
With all of the matches analyzed, we now have to decide what additional steps may need to be taken. If there are questions about factual information being plagiarized, we may need to discuss with an expert in that field.
Also, depending on the nature of the work, we may need to perform image matching or analyze scientific findings (once again with an outside expert). Any of these steps can take just as long or longer than the text analysis but aren’t needed in many, if not most, cases.
Either way, with the analysis complete, we’re able to start compiling our findings and that brings us to the final part of the process.
Compiling the Report
Once we finish the analysis we have to compile the report. To that end, no two reports look exactly alike as different clients want to know different information.
The reports range from a simple statement that the work does or does not contain plagiarized text (with examples and numbers to provide supporting evidence) to mammoth documents on the issues found and the recommended changes/fixes.
But while there’s no one-size-fits-all report, they all have to have the same information including:
- Summary of Findings: An executive summary of what was discovered
- Recap of the Process: A summary of what was done and how it was done
- Detailed Findings: Details of what was found.
- Analysis of Findings: Our analysis of the findings
- Final Conclusions: A summary of the entire project
Once again, this report can be as short as two pages or as long as several dozen pages in length. It’s all a matter of the size of the project and the needs of the client.
To be clear, this is only a brief overview of what goes into a typical plagiarism analysis. Many are far more complicated and have even more moving parts. It also doesn’t cover the correspondence, document acquisition and other elements that can greatly complicate such a project.
But most of what goes into these projects is time. Some of the larger ones we’ve done have taken weeks, even months to complete. One project took over a year to finish because it involved dozens of books and tomes of allegedly plagiarized material.
Even a smaller analysis can take several days to finish depending on the complexity and the amount of potentially copied text.
While anyone can run a work through a plagiarism checker and read the results, if you’re asking for an outside analysis, especially one that an expert will be staking their reputation on, be prepared for it to take some time.
With that time, does come expense. That’s why this kind of plagiarism analysis isn’t appropriate for all situations. Many people are better off simply performing the checks themselves.
However, when it is necessary, it can be time and money well spent.