The Problem with AI Polishing

Yesterday, Derek Newton at The Cheat Sheet highlighted a study from the University of Maryland that examined the effectiveness of AI detectors.
However, the study isn’t your typical examination of AI detection services (PDF). The study’s authors, Shoumik Saha and Soheil Feizi, aimed to investigate a specific issue, AI polishing.
To do this, the study’s authors created a massive database of text. The articles included works that were entirely written by humans, as well as papers that had undergone increasing levels of “polish” from AI systems.
The results were striking. Although the AI detectors did an overall great job of separating wholly human-written work from AI-generated work, any level of AI polish seemed to trigger the systems.
This could be an issue for many students as it means that any level of AI assistance, no matter how minor, may cause your work to be flagged as AI-generated.
However, even if it isn’t a significant problem, it is something that all authors, not just students, should be aware of.
What the Study Says
The researchers examined a total of 14,713 works. The database started with 300 human-written works from six different sources. Each of the sources was a different type of content, including blog posts, research paper abstracts, and emails, among others.
The papers were then polished using five separate large language models (LLMs). Those included three different versions of Llama, GPT-4o and DeepSeek V3. They were tasked to polish the works in 11 different ways. Four of the prompts were “degree-based” (extreme-minor, minor, slight-major, and major), and the remainder were percentage-based between 1%-75%.
They then checked those papers through 12 different AI detection systems. This included three commercial services and nine models they ran locally.
In some ways, the results were very positive—nearly all of the AI detection systems performed reasonably well in not falsely flagging the human-written text. The false positive rates ranged from 0% to 8%. Among the commercial detectors, Pangram performed the best (with no false positives), and GPTZero performed the worst at 7.5%.
However, things got interesting when the researchers added any degree of AI polishing. Regardless of the degree of polishing, the number of times it was flagged as AI jumped drastically.
For example, “minimal polishing” with GPT-4o can lead to detection rates between 10 and 75 percent. To make matters worse, there wasn’t much difference between minor polishing efforts and major ones. Once AI was involved, in any way, the detection rates skyrocketed.
For the paper’s authors, this is a false positive. Newton disagrees with that in his coverage.
But whether you consider it a false positive or not, there is a significant problem to consider: AI detectors are often a binary solution to an analog problem.
A Lack of Information
As we discussed back in November, AI usage is not a black-or-white matter. There is a gradient to it, ranging from fully human-written to fully AI-written.
However, I would argue that even the content in the author’s human-written collection was not wholly human-written. Many, if not most, of those works likely utilized automated grammar and spell-checking tools.
While there is nothing unethical about using such tools, it indicates a degree of computer involvement, even if it’s so normalized that most don’t even think about it.
This represents a significant problem for AI detection tools. Although they have improved dramatically over the past two years, their reporting remains very binary. Although some detectors may indicate their confidence in the results, they still either flag content as AI or not AI.
However, as this study highlights, a significant portion of writing is neither purely AI-written nor purely human; it lies in between. Unfortunately, the current crop of AI detectors can’t make that distinction.
Whether this is a detection problem or a user interface issue is unclear. What is clear is that AI detectors are erring on the side of flagging anything with AI markings. This means that a piece lightly edited by AI is treated the same as one that was wholly AI-generated.
To be clear, this isn’t an issue if AI is completely barred. But what about situations where the AI usage is flagged but doesn’t cross the line?
The Big Worry
To be clear, if a student had done what these researchers did, I wouldn’t have much sympathy for their work being flagged as AI-generated.
They fed human-written works to AI systems and directly prompted the AI to polish them. They deliberately involved an AI, and the flagging is likely reasonable, even if the flagging could benefit from additional context.
The bigger concern I have is with students and authors who use tools like Grammarly, LanguageTool and Microsoft Editor. These are popular grammar and spell-checking tools that also incorporate AI elements and can perform advanced sentence rewriting.
This discussion isn’t new. It’s been a debate for nearly two years now, primarily due to alleged false positives caused by these systems.
As a Grammarly user (but not of its generative features), I ran several of my latest posts through the commercial plagiarism checkers the paper’s authors used. None produced a clear false positive. However, that sample size is far too small and limited to have any significance.
There is a need for more research in this particular area. While the examination of AI polishing is fascinating, it doesn’t represent how many students are integrating the technology with their writing process.
Still, there is a great deal to be learned here.
Bottom Line
In the end, the key takeaway from this study really depends on your perspective.
When it comes to AI detectors, it’s clear that they’ve made significant strides in the past two years. While they can’t be relied on without human evaluation, that’s not new or unique to AI detectors. The “Swiss cheese” approach is still the best.
Still, educators need to be aware that they will flag ANY AI usage, even just minor polishing, as AI. This is an important consideration when evaluating an AI report. This is especially true if you haven’t explicitly banned all AI usage on the assignment.
For students, it’s another reason to be wary about using AI in any capacity, except when expressly permitted. It’s also crucial to ensure that you do your work in an application that has versioning, and consider using tools like Grammarly Authorship Verification or Turnitin Clarity to provide additional evidence of how an assignment was created.
All in all, it’s an interesting study that provides valuable insights into how AI detectors analyze hybrid works. Whether these results are good or bad depends on the specific use case. However, it does point to ways that AI detectors could and should improve their results in the future.
Note: If you haven’t subscribed to The Cheat Sheet by Derek Newton and are interested in academic integrity, I highly recommend it.
Want to Reuse or Republish this Content?
If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.