Generative AI is a data hog.
The algorithms behind chatbots like ChatGPT learn to create human-like content by scraping terabytes of online articles, Reddit posts, TikTok captions, or YouTube comments. They find intricate patterns in the text, then spit out search summaries, articles, images, and other content.
For the models to become more sophisticated, they need to capture new content. But as more people use them to generate text and then post the results online, it’s inevitable that the algorithms will start to learn from their own output, now littered across the internet. That’s a problem.
A study in Nature this week found a text-based generative AI algorithm, when heavily trained on AI-generated content, produces utter nonsense after just a few cycles of training.
“The proliferation of AI-generated content online could be devastating to the models themselves,” wrote Dr. Emily Wenger at Duke University, who was not involved in the study.
Although the study focused on text, the results could also impact multimodal AI models. These models also rely on training data scraped online to produce text, images, or videos.
As the usage of generative AI spreads, the problem will only get worse.
The eventual end could be model collapse, where AI increasing fed data generated by AI is overwhelmed by noise and only produces incoherent baloney.
Hallucinations or Breakdown?
It’s no secret generative AI often “hallucinates.” Given a prompt, it can spout inaccurate facts or “dream up” categorically untrue answers. Hallucinations could have serious consequences, such as a healthcare AI incorrectly, but authoritatively, identifying a scab as cancer.
Model collapse is a separate phenomenon, where AI trained on its own self-generated data degrades over generations. It’s a bit like genetic inbreeding, where offspring have a greater chance of inheriting diseases. While computer scientists have long been aware of the problem, how and why it happens for large AI models has been a mystery.
In the new study, researchers built a custom large language model and trained it on Wikipedia entries. They then fine-tuned the model nine times using datasets generated from its own output and measured the quality of the AI’s output with a so-called “perplexity score.” True to its name, the higher the score, the more bewildering the generated text.
Within just a few cycles, the AI notably deteriorated.
In one example, the team gave it a long prompt about the history of building churches—one that would make most human’s eyes glaze over. After the first two iterations, the AI spewed out a relatively coherent response discussing revival architecture, with an occasional “@” slipped in. By the fifth generation, however, the text completely shifted away from the original topic to a discussion of language translations.
The output of the ninth and final generation was laughably bizarre:
“architecture. In addition to being home to some of the world’s largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-.”
Interestingly, AI trained on self-generated data often ends up producing repetitive phrases, explained the team. Trying to push the AI away from repetition made the AI’s performance even worse. The results held up in multiple tests using different prompts, suggesting it’s a problem inherent to the training procedure, rather than the language of the prompt.
Circular Training
The AI eventually broke down, in part because it gradually “forgot” bits of its training data from generation to generation.
This happens to us too. Our brains eventually wipe away memories. But we experience the world and gather new inputs. “Forgetting” is highly problematic for AI, which can only learn from the internet.
Say an AI “sees” golden retrievers, French bulldogs, and petit basset griffon Vendéens—a far more exotic dog breed—in its original training data. When asked to make a portrait of a dog, the AI would likely skew towards one that looks like a golden retriever because of an abundance of photos online. And if subsequent models are trained on this AI-generated dataset with an overrepresentation of golden retrievers, they eventually “forget” the less popular dog breeds.
“Although a world overpopulated with golden retrievers doesn’t sound too bad, consider how this problem generalizes to the text-generation models,” wrote Wenger.
Previous AI-generated text already swerves towards well-known concepts, phrases, and tones, compared to other less common ideas and styles of writing. Newer algorithms trained on this data would exacerbate the bias, potentially leading to model collapse.
The problem is also a challenge for AI fairness across the globe. Because AI trained on self-generated data overlooks the “uncommon,” it also fails to gauge the complexity and nuances of our world. The thoughts and beliefs of minority populations could be less represented, especially for those speaking underrepresented languages.
“Ensuring that LLMs [large language models] can model them is essential to obtaining fair predictions—which will become more important as generative AI models become more prevalent in everyday life,” wrote Wenger.
How to fix this? One way is to use watermarks—digital signatures embedded in AI-generated data—to help people detect and potentially remove the data from training datasets. Google, Meta, and OpenAI have all proposed the idea, though it remains to be seen if they can agree on a single protocol. But watermarking is not a panacea: Other companies or people may choose not to watermark AI-generated outputs or, more likely, can’t be bothered.
Another potential solution is to tweak how we train AI models. The team found that adding more human-generated data over generations of training produced a more coherent AI.
All this is not to say model collapse is imminent. The study only looked at a text-generating AI trained on its own output. Whether it would also collapse when trained on data generated by other AI models remains to be seen. And with AI increasingly tapping into images, sounds, and videos, it’s still unclear if the same phenomenon appears in those models too.
But the results suggest there’s a “first-mover” advantage in AI. Companies that scraped the internet earlier—before it was polluted by AI-generated content—have the upper hand.
There’s no denying generative AI is changing the world. But the study suggests models can’t be sustained or grow over time without original output from human minds—even if it’s memes or grammatically-challenged comments. Model collapse is about more than a single company or country.
What’s needed now is community-wide coordination to mark AI-created data, and openly share the information, wrote the team. “Otherwise, it may become increasingly difficult to train newer versions of LLMs [large language models] without access to data that were crawled from the internet before the mass adoption of the technology or direct access to data generated by humans at scale.”
Image Credit: Kadumago / Wikimedia Commons
* This article was originally published at Singularity Hub
0 Comments