Earlier this week, a group of four researchers from the Department of Computer Science of the University of Copenhagen and the University of Electronic Science and Technology of China released a paper on the pre-publication archive arXiv that looks at how much large language models (LLMs) “remember” from the copyright-protected material.
To that end, the researchers took a very direct approach. They simply asked various artificial intelligence (AI) systems questions such as “What is the first page of [book title]” or “Please Complete the following text…” and kept track of what the AI produced.
To that end, they focused on a list of 19 popular books that were all released after 1930 to ensure that they are still protected by copyright. This included books such as Harry Potter and the Sorcerer’s Stone, Gone with the Wind, and so forth.
The results varied wildly from book to book, with Harry Potter being the most remembered title. However, one trend that emerged was that the larger the LLM is, the more it remembered and the more it reproduced when prompted.
To be clear, the paper has a slew of limitations, most of which the authors acknowledge. First, it has not been peer-reviewed and is in the pre-publication phase.
Second, though it did test GPT-3.5, it did not look at the newer GPT-4, which may have very different results. Third, the direct approach the researchers took is interesting, but may not be representative of how most people use AI and what ordinary outputs may look like.
Finally, while the paper does discuss the issue in the context of copyright, it stops short of drawing any legal conclusions. Instead, it says that, “Overall, this paper serves as a first exploration of verbatim memorization of literary works and educational material in large language models.”
In that regard, it does work well, especially when one looks at the legal climate the paper is coming out in.
A Dangerous Legal Climate
The lawsuits all allege the same thing, that AI systems were unlawfully trained using copyright-protected works and that it amounts to an infringement. AI companies, for their part, argue that their use is a fair use and that no infringement took place.
But having a real discussion about how AI makes use of copyrighted work is difficult. That’s because, as we discussed before, AI is a black box. We know what goes into an AI system, and we can track what comes out, but even the creators don’t have a great deal of insight on what happens in between.
What this study does is give us a peek into that black box. Even though it doesn’t tell us how much of the original work it retains, just how much it would output with those specific prompts, it does make it clear that the original work is retained, at least in some form.
While this was largely known beforehand, thanks both to similar tests and the fact that AIs often get caught outright plagiarizing, this is an interesting quantification of the problem and an interesting comparison of different LLMs.
What is perhaps most interesting is that, the larger the LLM, the more it seems to retain more (or at least be willing to repeat more verbatim). While this could be a comment on the guardrails and protections of certain LLMs, it’s still an issue worth examining.
In the end, the most important thing that this study seems to show is that the copyright-protected works are in the LLM systems. While we knew that the systems were trained on them, it’s clear that they are still retained, likely wholesale, even after training.
That may give new ammunition to the authors and other creators currently taking such systems to court.
The impact and importance of this study, most likely, is going to be very limited. There’s not a lot of truly new information and what is new may be more of a comment on the protections and guardrails the systems have, not how they use copyrighted material.
Still, it is a very interesting peek into the black box of AI and a surprisingly simple approach to get an idea of how much the AI remembers and is willing to share verbatim.
For all of the concerns about AI originality and how AI systems use copyright protected work, the AIs themselves aren’t very bashful about showing how much of the original work they store.
In short, it’s clear that the copyright protected works are in the AI systems and can be accessed verbatim with trivial amounts of effort. That should give not just authors, but the developers of AI systems, pause to think about the implications.