YouTube, AI and the Age of Content Laundering

Earlier this week, an investigation by Proof News and Wired found that some 170,000 YouTube videos from 48,000 accounts used an open-source dataset named YouTube Subtitles.

That set has been used by a variety of tech companies, including Apple, Anthropic, Nvidia, and Salesforce, to train AI systems.

The dataset features videos from many of YouTube’s biggest creators, including Mr. Beast, Marques Brownlee and PewDiePie. It also includes media outlets such as ABC News, NPR and the Wall Street Journal.

The dataset only includes subtitles and does not feature any audio or visual content.

To be clear, the dataset is not particularly large, especially when compared to YouTube’s library. However, it’s part of The Pile, a massive 825 GB content dataset. Until August 2023, the Pile famously included Books3, a dataset of nearly 200,000 books, including many pirated works.

The Books3 dataset is central to various copyright infringement lawsuits against AI companies.

While most YouTubers are not involved in this subtitle dataset, it should still serve as a wake-up call. AI companies will use your videos for training, not just video AIs. Thanks to subtitles, your words are at risk, too.

For everyone else, this is a reminder that we are in the age of content laundering. AI companies don’t need to steal your content. They can have others do it for them.

Why Steal Content Yourself?

One of the more interesting conversations to come out of the “YouTube Subtitle” dataset controversy concerns YouTube’s terms of service. According to an interview with Alphabet and Google CEO Sundar Pichai, if OpenAI had scraped YouTube videos, it would have violated the site’s TOS.

However, none of the tech companies involved did the scraping. They didn’t have to. Apple, Anthropic and others have “clean” hands. Though we don’t know who scraped the data or why, it wasn’t the tech companies.

But that’s not much different from Books3. Shawn Presser, an open-source AI advocate, originally uploaded Books3. Presser’s goal was to create an open-source version of Books1 and Books2, which are both proprietary.

We know very little about Books 1 and 2. However, it is widely speculated that they both contain pirated content.

But, whether proprietary or open source, one thing is certain: AI companies don’t need to steal your content. They either already have it or can get it from someone else who stole it for them.

To be clear, AI companies can crawl the web for content themselves. OpenAI has even explained how to block their ChatGPT crawler. However, they most likely already have your content. If they don’t, they can get it from somewhere else.

Did a Reddit user repost your content? Google has access to it and even paid Reddit for the privilege. The same is true for Stack Overflow, DeviantArt and other communities. Though these communities primarily host original content, no community site can claim to be 100% infringement-free.

But that doesn’t matter to the AI companies. They’re paying for the access, not the copyright.

The Age of Content Laundering

The large AI companies argue that using copyright-protected material to train AI systems is not an infringement. They argue that it is a fair use, even though recent rulings throw that argument into doubt.

However, it will likely be years before the legal questions in this space are fully answered. Also, even if copyright holders secure a major win, AI will likely not end. There are too many large silos of licensed content that AI systems can fall back to.

However, copyright isn’t the only protection your work typically enjoys. An excellent example is the aforementioned YouTube terms of service, which prevents data scraping. This includes subtitles.

However, Apple and Anthropic didn’t scrape YouTube. Someone else did. Tech companies are merely reaping the benefits of someone else’s violation. Regardless of whether this is an infringement, it still represents content laundering.

To make matters worse, the scraper did this in the name of “open source” AI. Open source can be a wonderful thing when everyone involved consents to the terms. However, neither the YouTubers involved nor YouTube consented.

This is against the very spirit of open source. Open source is about transparency and collaboration. This is neither. Instead, it’s about AI developers having access to content without having to violate YouTube’s TOS itself.

Whether or not training AI systems on copyright-protected works is a copyright infringement, there are still other laws at play. Some seem to be all too happy to break them so that AI developers don’t have to.

Bottom Line

Right now, we are in a period of uncertainty when it comes to AI and copyright. It will likely be that way for several years.

Since AI companies are moving full steam ahead as if copyright issues are nonexistent, one of the few ways you can protect your content is to restrict access to it.

But that doesn’t matter if others access and launder the content for AI companies. Whether it’s scraping subtitles for a dataset or uploading copied articles to Reddit, others are feeding AI systems with content they don’t own.

AI companies are perfectly happy with this arrangement. Consent be dammed. Though they are paying for licenses from some larger rightsholders, the deals with Reddit and Stack Overflow are about access, not copyright.

AI companies know full well that not all the content on those sites is legal. That’s not the point. They are simply using a middle person to launder the content and get access to content that they might not have been able to otherwise, whether due to legal or technical limitations.

However, that is the age we are in, where if you can’t legally access content, you can get someone else to break the rules on your behalf.

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free