Stanford University Students Accused of Plagiarizing AI Model

In the field of plagiarism, AI has been the most important and most discussed topic for several years.

First, there’s the growing issue of students using generative AI to cheat on written assignments. That has raised a great deal of concern among educators.

However, those concerns aren’t limited to the classroom. The concerns have spilled out as photographers, artists, and authors are all concerned about AI-generated work being passed off as human creations.

Then there is the nature of AI itself. Since most models are trained on large amounts of unlicensed content, many consider all generative AI to be a work of plagiarism. This issue came to a head during the recent Writers Guild of America strike, during which the union referred to AI systems as “Plagiarism machines.”

But what about when an AI model plagiarizes another AI model? What are the ethics and norms there?

The space is new, so many norms and expectations have not been established. That said, a recent scandal involving a Stanford University AI project and a Chinese company may hint at some of the boundaries.

The Stanford University AI Plagiarism Scandal

The story began on May 29. That was when three authors, Siddharth Sharma, Aksh Garg and Mustafa Aljadery, announced the release of their Llama3V language model. According to them, the model could be trained for under $500 and would have comparable performance to GPT4-V, Gemini Ultra and Claude3.

The massive promises got the attention of the AI community. Soon, others began to examine the work of the three authors, two of whom (Sharma and Garg) are Stanford undergrads. Several people then noticed similarities between Llama3V and the MiniCPM-Llama3-V 2.5 model by the Chinese startup Mianbi Intelligence.

A GitHub user named pzc163 posted a lengthy thread highlighting the similarities. They specifically noted how Llama3V had copied significant code, formatting and other elements. They also noted that the two models gave many of the same answers, including making the same mistakes.

Aljadery attempted to defend his work, saying that it predates MiniCPM’s public release and that they had only used the tokenizer from another project. Pzc163 easily disproved those claims, alerting MiniCPM’s team to the discovery.

However, social media and GitHub posts about Llama3V quickly began to disappear. Sharma and Garg posted identical apologies on X (formerly Twitter). They blamed Aljadery for the plagiarism, saying he was responsible for coding the model. They apologized for not checking his work more thoroughly but said they were unaware of the previous work.

They said they had taken the Llama3V model down “in respect to the original work.”

While that brings the case to a close, it leaves many unanswered questions.

AI’s Other Plagiarism Problem

Regarding plagiarism and AI, nearly all the attention has been focused on how AI is trained and what it outputs. This is understandable, as these questions impact nearly every creator on the planet.

However, it’s easy to forget that humans also wrote the code used in AI systems. As such, AI programmers have to worry not just about the input and output of their system but that they don’t unethically or illegally copy the work of other AI creators.

This can be difficult. There are countless AI systems out there, and they are licensed under various terms, ranging from fully open to wholly closed sources. It’s often difficult to know what you can legally use.

But then comes the issue of ethics. As this case highlighted, even if a use is technically legal, it can still create ethical issues. Mianbi’s CEO said, “While it’s good to be recognized by international teams, we believe in building a community that’s open, cooperative, and trustworthy. We want our team’s work to be noticed and respected, but not in this manner.”

The MiniCPM model is open source under an Apache-2.0 License. With proper attribution and disclosures, it would have been completely legal to use. The lack of attribution raised not just ethical issues but copyright ones as well.

Though a lawsuit is unlikely, especially given that two of the three students are undergraduates (Garg and Sharma), this is still a warning for companies entering this space.

So, while AI companies may be used to battling lawsuits from creators, they can’t afford to disregard their competitors. As this case shows, copied code can lead to major headaches in this space.

Bottom Line

The biggest unanswered question in this story is, “What will Stanford do?”

Though the project has been promoted as having a connection with Stanford, it’s unclear what that connection is.

The only clue came from a statement by the director of the Artificial Intelligence Laboratory, Christopher Manning. In it, he said he was unaware of the work and said it was “done by a few undergrads.” As such, it seems unlikely that this was done as part of coursework or working with the lab.

If it were part of their coursework, disciplinary hearings and possible action would likely be taken. However, if it isn’t and the lab wasn’t involved, there may not me much that Stanford can do.

Regardless, due to student privacy rules, we are unlikely to learn what steps the school takes. Ultimately, the story is an unfortunate and unfair black eye for the school’s reputation.

While it normally wouldn’t matter to a school if some undergrad students plagiarize outside of their school work. As we saw in the Kaavya Viswanathan scandal in 2006, these stories attract a great deal of attention, especially when a prestigious school is involved.

Combined with the increased, if often unfair, scrutiny schools face around plagiarism and the heavy focus on AI, Stanford will struggle to distance itself from this.

As for the students, I ultimately agree with Mianbi Intelligence’s Chief Scientist, Liu Zhiyuan. If they can learn from this and correct their mistakes, there’s no reason they can’t have a bright future ahead of them.

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free