Popular YouTuber Marques Brownlee, screenshot via YouTube
AI is supposed to be the miracle technology of our age, but I still can’t get past the fact that it seems to need an extensive amount of “training” from the intellectual property and copyrighted works of humans. It’s like Pac-Man eating up all the books, music, and art in its path. And now add to that illustrious group: YouTube videos. You see, a nonprofit called EleutherAI believes that the development of AI, though expensive, should not be controlled solely by Big Tech. So in 2020 they released a dataset called The Pile, which they describe as “a large-scale corpus for training language models, composed of 22 smaller sources,” and it’s free to download. Basically, EleutherAI has created an AI training model for the masses to use. So naturally, large companies like Apple also use their services. It is through the Pile that Apple and others have fed their own AI models with YouTube videos, and the video creators are not happy:

The Pile was not intended for Big Tech, but here we are: AI models at Apple, Salesforce, Anthropic, and other major technology players were trained on tens of thousands of YouTube videos without the creators’ consent and potentially in violation of YouTube’s terms, according to a new report appearing in both Proof News and Wired. The companies trained their models in part by using “the Pile,” a collection by nonprofit EleutherAI that was put together as a way to offer a useful dataset to individuals or companies that don’t have the resources to compete with Big Tech, though it has also since been used by those bigger companies.

Creators are seeing red, but it’s a legal gray area: The Pile includes books, Wikipedia articles, and much more. That includes YouTube captions collected by YouTube’s captions API, scraped from 173,536 YouTube videos across more than 48,000 channels. That includes videos from big YouTubers like MrBeast, PewDiePie, and popular tech commentator Marques Brownlee. On X, Brownlee called out Apple’s usage of the dataset, but acknowledged that assigning blame is complex when Apple did not collect the data itself. He wrote: “Apple has sourced data for their AI from several companies. One of them scraped tons of data/transcripts from YouTube videos, including mine. Apple technically avoids “fault” here because they’re not the ones scraping. But this is going to be an evolving problem for a long time.”

Gotta love geek humor: Coincidentally, one of the videos used in the dataset was an Ars Technica short film wherein the joke was that it was already written by AI. Proof News’ article also mentions that it was trained on videos of a parrot, so AI models are parroting a parrot, parroting human speech, as well as parroting other AIs parroting humans. As AI-generated content continues to proliferate on the internet, it will be increasingly challenging to put together datasets to train AI that don’t include content already produced by AI.

Is it fair use? The Pile is often used and referenced in AI circles and has been known to be used by tech companies for training in the past. It has been cited in multiple lawsuits by intellectual property owners against AI and tech companies. Defendants in those lawsuits, including OpenAI, say that this kind of scraping is fair use. The lawsuits have not yet been resolved in court.

The Pile is a ‘robust data collection’ of intellectual property: However, Proof News did some digging to identify specifics about the use of YouTube captions and went so far as to create a tool that you can use to search the Pile for individual videos or channels. The work exposes just how robust the data collection is and calls attention to how little control owners of intellectual property have over how their work is used if it’s on the open web. It’s important to note that it is not necessarily the case that this data was used to train models to produce competitive content that reaches end users, however.

[From Ars Technica]

Whelp, this is social media all over again, right? In that the technology is moving faster than the public can fully understand its consequences, and certainly faster than laws can regulate it. There are so many layers to this, my brain broke thinking about all the laws that need to be written. Were the videos used to train AI to make competitive material, or purely for “education” purposes? Does that distinction even matter when it comes to intellectual property? A lot of this will come down to YouTube’s terms of service, and how courts interpret the protections for creator content outlined there. But my biggest question is this: let’s say all the legal issues are ironed out and creators’ works are protected (I know, ha)… Then what was all this (alleged) theft for? I’m genuinely hoping for an answer beyond, they fed AI the material just because they could.

Photos via YouTube/Marques Brownlee

ncG1vNJzZmivp6x7pLHLnpmirJOdxm%2BvzqZmcXBhZYR2e9iorK2tkpqspL7EmquoqqOUvLbA0ZqenpyPqbWiwL6ap6mklZTBs63Ip5ydl5GerLC6vq2fnqGilMOqsMSoqpivmam1sMHTmJqopqOau7V7