Digital Inbreeding Could Cause AI Systems to Collapse

By Salomão André Artificial Intelligence, Tech AI Systems to Collapse, Digital Inbreeding Comments Off

Artificial intelligence (AI) prophets and newsmongers are forecasting the end of the generative AI hype, with talk of an impending catastrophic “model collapse“.

But how realistic are these predictions? And what is model collapse anyway?

“Model collapse,” a concept discussed in 2023 but gaining more attention recently, describes a hypothetical situation where AI systems become increasingly less effective as the amount of AI-generated data on the internet grows.

Modern AI systems rely on machine learning, where programmers establish the mathematical framework, but the “intelligence” comes from training the system to recognize patterns in data.

However, these generative AI systems require vast amounts of high-quality data. Major tech companies like OpenAI, Google, Meta, and Nvidia continuously collect terabytes of content from the internet to train their models. Since the rise of generative AI in 2022, there’s been an increase in AI-generated content online.

Exploring AI-Generated Data for Training Models

In 2023, researchers began exploring whether AI-generated data alone could be used for training, rather than relying on human-generated data. This approach has significant advantages: AI-made content is cheaper and less problematic to collect compared to human data.

Yet, researchers discovered that training AI solely on AI-generated data leads to a decline in performance. As each model learns from previous ones, it results in a “regurgitative training” effect, reducing both the quality and diversity of AI outputs. Quality here means the AI’s usefulness, safety, and honesty, while diversity refers to the range of responses and the representation of different cultural and social perspectives.

In summary, excessive use of AI systems may be contaminating the data sources essential for their effectiveness.

Can big tech simply filter out AI-generated content? Not really. Tech companies already invest significant time and resources into cleaning and filtering the data they collect, with some discarding up to 90% of the initial data used for training models.

As the need to exclude AI-generated content grows, these efforts will become even more challenging. Moreover, distinguishing AI-generated content will become increasingly difficult over time, making the process of filtering out synthetic data less financially viable.

Ultimately, research indicates that human data remains essential, as it is the foundation of AI’s “intelligence.”

Challenges in Acquiring High-Quality Data

There are signs that developers are increasingly struggling to obtain high-quality data. For example, the documentation for the GPT-4 release noted an unusually large number of staff dedicated to handling data-related aspects of the project.

We might also be running out of new human-generated data, with some estimates suggesting the supply could be exhausted by 2026.

This may explain why OpenAI and other companies are forming exclusive partnerships with major players like Shutterstock, Associated Press, and NewsCorp, which own extensive collections of proprietary human data not available on the public internet.

However, the risk of a catastrophic model collapse may be exaggerated. Most research focuses on scenarios where synthetic data completely replaces human data, but in reality, human and AI-generated data are likely to grow alongside each other, mitigating the risk of collapse.

A more probable future scenario involves a diverse range of generative AI platforms creating and publishing content, rather than a single dominant model. This diversity enhances resilience against collapse.

This underscores the importance of regulators promoting healthy competition by curbing monopolies in the AI industry and supporting the development of public interest technologies.

There are also more subtle dangers associated with an overabundance of AI-generated content.

An excess of synthetic content might not endanger the progress of AI development, but it does threaten the digital public good of the human internet.

Impact of AI Assistance

For example, researchers observed a 16% decline in activity on the coding site StackOverflow a year after ChatGPT’s release, suggesting that AI assistance might be decreasing direct interactions within some online communities.

AI-Generated Content Surge Makes It Hard to Find Non-Clickbait Material

It’s becoming increasingly hard to tell human-generated content from AI-generated content. One solution could be watermarking or labeling AI-generated content, a concept recently supported by Australian government interim legislation and discussed by many experts.

Additionally, the growing uniformity of AI-generated content risks diminishing socio-cultural diversity, potentially leading to cultural erasure for some groups. There is an urgent need for cross-disciplinary research to address the social and cultural implications of AI systems.

Protecting human interactions and data is crucial, both for our own well-being and to potentially mitigate the risk of future model collapse.

Read the original article on: Science Alert

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Digital Inbreeding Could Cause AI Systems to Collapse