Microsoft is Investigating a Method to Recognize and Credit Contributors to AI Training Data

Microsoft is Investigating a Method to Recognize and Credit Contributors to AI Training Data

Microsoft is launching a research initiative to assess how specific training data influences the text, images, and other media generated by AI models.
Image Credits:JASON REDMOND / AFP / Getty Images

Microsoft is launching a research initiative to assess how specific training data influences the text, images, and other media generated by AI models.

A recently resurfaced job listing from December reveals the company is seeking a research intern for the project, which aims to demonstrate that AI models can be trained to estimate the impact of particular data—such as photos and books—on their outputs efficiently and meaningfully.

The listing notes that current neural network architectures lack transparency in attributing their sources. It emphasizes the importance of changing this to provide incentives, recognition, and possibly compensation for contributors of valuable data, especially as AI models continue to evolve in unexpected ways.

The Legal Battle Over AI-Generated Content

AI-generated text, code, images, videos, and music are at the center of multiple intellectual property lawsuits against AI companies. These firms often train their models on vast datasets scraped from public websites, some of which contain copyrighted material. While many argue that fair use protects their data collection practices, artists, programmers, and authors largely disagree.

Microsoft itself is facing at least two copyright-related lawsuits. In December, The New York Times sued Microsoft and its AI partner, OpenAI, alleging their models were trained on millions of the newspaper’s articles without permission. Additionally, software developers have taken legal action against Microsoft, claiming its GitHub Copilot AI assistant was unlawfully trained on their copyrighted code.

Microsoft’s latest research initiative, referred to in the job listing as “training-time provenance,” reportedly involves Jaron Lanier, a leading technologist and researcher at Microsoft. In an April 2023 New Yorker op-ed, Lanier discussed “data dignity,” a concept focused on linking digital content to the individuals who created it.

A data-dignity approach would identify the most unique and influential contributors whenever a large AI model generates a valuable output,” Lanier explained. “For example, if a model creates ‘an animated film of my kids in an oil-painted world of talking cats on an adventure,’ key oil painters, cat portraitists, voice actors, and writers—or their estates—could be recognized as essential to its creation. They would receive acknowledgment, incentives, and potentially even compensation.”

Emerging Compensation Models for AI Training Data

Several companies are already exploring similar ideas. AI developer Bria, which recently secured $40 million in venture funding, claims to compensate data owners based on their “overall influence.” Adobe and Shutterstock also provide payouts to dataset contributors, though the details of these payments remain largely undisclosed.

Most major AI labs, however, have not implemented direct compensation programs for individual contributors, opting instead to secure licensing deals with publishers, platforms, and data brokers. In many cases, they offer copyright holders the ability to “opt out” of future training, though these processes can be cumbersome and do not apply retroactively to models already trained on the data.

Microsoft’s initiative could ultimately remain a proof of concept. OpenAI made similar promises in May, announcing plans to develop a tool allowing creators to control how their work is used in AI training. Nearly a year later, the tool has yet to materialize and reportedly hasn’t been a priority internally.

Critics suggest Microsoft may be engaging in “ethics washing” to preempt potential regulatory actions or legal rulings that could disrupt its AI business. This effort is particularly notable given the stance of other leading AI labs on fair use. Companies like Google and OpenAI have advocated for weaker copyright protections related to AI training, with OpenAI specifically urging the U.S. government to formally enshrine fair use exemptions for model training to ease legal constraints on developers.


Read the original article on: TechCrunch

Read more: Microsoft Announces a Significant Quantum Breakthrough—But What Does It Mean?

Share this post

Leave a Reply