GitHub, Copilot and the Copyright Around AI

In late June, GitHub, a popular code development tool owned by Microsoft, announced the launch of a new AI tool, Copilot, that it says can help users create new code.

On the surface, Copilot seems like a very impressive tool. Users coding in GitHub can be presented blocks of code to solve specific problems that they are trying to solve. The AI looks at the code the programmer is writing, determines what that person is trying to do, and then suggests code to meet that goal.

However, it wasn’t long after the announcement of Copilot that it began to draw criticism. The most vocal critics were in the open-source programming community, who were upset that GitHub trained their AI on significant amounts of open-source code and can even reproduce that code verbatim in some circumstances, but does not follow the respective open-source licenses.

This has led to a massive debate about the legality of Copilot. Some feel strongly that Copilot is infringing, one person, Jake Williams from Breachquest, even challenged GitHub to use Microsoft-owned code for training.

However, others, including Julia Reda, a former Pirate Party MEP, published a blog post claiming that it isn’t. She says that text and data mining is not a copyright infringement and that, since an AI cannot produce a copyright-protected work, its output cannot be an infringement.

Unfortunately, the real answer is that no one actually knows. As was pointed out in this excellent article in The Verge, there is no legal precedent here. The technology is simply too new to know for certain, as no judge nor jury has really ruled on it.

That said, both sides are actually ignoring one of the bigger issues with an AI like Copilot. Namely, that if it does spit out infringing code, it won’t be Copilot or GitHub that’s held responsible. It’ll be the user. The reason is that Copilot is not an AI that works to write all-new code, it’s a programming helper, much like an aggressive grammar checker or autocorrect.

The end user is ultimately responsible for what they write and what they publish.

Still, it’s worth taking a deep dive into Copilot and see how and if it could be infringing.

The Two Big Questions

With Copilot, there are two separate but equally important legal questions that need to be looked at.

  1. Training: Copilot was trained on a wide variety of publicly available code, much of it open-source. Was that an infringement of those rights?
  2. Output: The code that Copilot outputs is often similar and, in some rare cases, identical to the code it used to train, is that code an infringement.

The answer to both questions is, unfortunately, it depends.

On the first question, the Google Book Search case is often cited as one that was a victory for such data mining. In August 2013, the Second Circuit Court of Appeals found that Google could legally scan and enter books into a database for the purpose of creating a search engine. According to the court, Google’s use was transformative enough to be a fair use.

However, GitHub isn’t creating a search engine, it’s creating a code-writing tool that may or may not produce original code depending on the circumstance.

Also, as pointed out by this paper by Mark A Lemley and Bryan Casey, there are some seemingly conflicting decisions. That includes the August 2018 decision, also by the Second Circuit, that TVEyes infringed on Fox News’ copyright by copying all of Fox’s content for the purpose of creating a media clip search and sharing tool.

Even within the same circuit, there are two very different results, dependent upon differences in what the companies were doing with the content.

This brings us to the second question, which literally depends on the exact snippet of code involved. Is the code very similar to earlier existing code? Is it identical? Is the code original enough to qualify for copyright protection? Is the snippet of code long enough to qualify for copyright protection? Is it so short that its use is a fair use?

These are questions that must be asked on a case-by-case basis. To make matters worse, that entire discussion is made even more confusing by the recent Supreme Court decision in the Google/Java case, which seems to weaken protection for code in many ways.

However, if Copilot does spit out infringing code, it won’t be Copilot, GitHub or Microsoft paying the price for it. It’ll be the person using it. After all, it is “their” code that turned out to be infringing.

The Copilot Bails

Let’s pretend for a second that Copilot isn’t for programmers and, instead, is a writing aid that spits out paragraphs of text for me to use in my essay, novel, or other work.

As a writer, this is simply a very aggressive writing aid tool. It functions much like a grammar checker or autocorrect tool that makes suggestions for new text that I can either accept or reject. But what happens if one of those paragraphs turns out to be a verbatim or near-verbatim copy of an earlier work?

It’s the author that will ultimately be held responsible. It’s easy to imagine students being accused of plagiarism because their AI writing tool reproduced work nearly wholesale from a third party. It’s also easy to imagine a novelist facing a copyright infringement lawsuit because their AI story editor closely reproduced one of the works it was trained on.

While the risk with Copilot is relatively low due to the fact it produces short snippets, that doesn’t mean zero risk. Combine this with the fact that there are so many untested legal waters when it comes to AI and you have a recipe for legal uncertainty.

Between the recent Supreme Court decision, the lack of precedent around AI and the newness of the technology broadly, there’s just far too many unknowns here to even hazard a guess on where the law stands.

The legal issues around AI will likely take decades to settle and those that are producing AI systems or, in some cases, using them are stepping into that uncertainty.

That’s not to say it’s a bad thing or even that it isn’t necessary, but it’s something we have to be honest about.

Bottom Line

While I understand the desire for answers when it comes to AI and copyright, there just aren’t many to be found. There are a slew of unknowns, and AI systems are taking a nearly infinite number of approaches to achieve a nearly infinite number of goals.

It will be a long time before the law catches up to where AI is today and, by then, AI will be something new altogether. This is an area that will always be on the bleeding edge of technology, and that means it will be in front the bleeding edge of the law.

Though people for and against systems like Copilot can sound certain with their opinions, the truth is that we just don’t know where the law and AI will meet and how.

One thing is for certain, when we do get those answers, they may be some of the most important questions copyright has ever addressed.