Researchers Suggest OpenAI Trained its Models on Paywalled O’Reilly Books

Researchers Suggest OpenAI Trained its Models on Paywalled O’Reilly Books

OpenAI has faced multiple accusations of using copyrighted content without permission to train its AI models. A new paper from the AI Disclosures Project, an organization focused on AI transparency, makes a serious claim that OpenAI has increasingly relied on non-public, unlicensed books to train its advanced AI models.
Image Credits:Jakub Porzycki/NurPhoto / Getty Images

OpenAI has faced multiple accusations of using copyrighted content without permission to train its AI models. A new paper from the AI Disclosures Project, an organization focused on AI transparency, makes a serious claim that OpenAI has increasingly relied on non-public, unlicensed books to train its advanced AI models.

AI models work as sophisticated prediction engines, trained on vast datasets like books, movies, and TV shows, to learn patterns and generate responses based on prompts. When a model “writes” an essay or “draws” an image, it’s simply drawing on its extensive training to approximate, rather than create something entirely new.

While many AI labs, including OpenAI, have turned to AI-generated data to train models as they run out of real-world data, few have abandoned real-world sources altogether. Training exclusively on synthetic data could harm the model’s performance.

AI Disclosures Project Suggests OpenAI Used Paywalled O’Reilly Books for GPT-4o Training

The AI Disclosures Project, a nonprofit founded by media mogul Tim O’Reilly and economist Ilan Strauss, suggests in its paper that OpenAI likely used paywalled books from O’Reilly Media to train its GPT-4o model. O’Reilly Media, led by Tim O’Reilly, does not have a licensing agreement with OpenAI, according to the paper.

The co-authors of the paper noted, “GPT-4o, OpenAI’s more advanced and capable model, shows a strong recognition of paywalled O’Reilly book content, especially when compared to the earlier GPT-3.5 Turbo model.” They added, “In contrast, GPT-3.5 Turbo shows greater recognition of publicly available O’Reilly book samples.”

The paper utilized a method called DE-COP, first introduced in a 2024 academic study, which detects copyrighted content in language model training data. This “membership inference attack” tests whether a model can distinguish between human-authored texts and AI-generated paraphrases of the same content. If successful, it suggests the model may have encountered the text during training.

Co-authors Analyze OpenAI Models’ Knowledge of O’Reilly Media Books

The paper’s co-authors—O’Reilly, Strauss, and AI researcher Sruly Rosenblat—examined the knowledge of GPT-4o, GPT-3.5 Turbo, and other OpenAI models regarding O’Reilly Media books, both before and after their training cutoff dates. They used 13,962 paragraph excerpts from 34 O’Reilly books to estimate the likelihood that a specific excerpt was included in the training data.

The results showed that GPT-4o recognized far more paywalled O’Reilly book content compared to older models, particularly GPT-3.5 Turbo. This was true even when accounting for potential factors like newer models’ improved ability to identify human-authored text.

The co-authors concluded, “GPT-4o likely recognizes, and thus has prior knowledge of, many non-public O’Reilly books published before its training cutoff date.”

The co-authors are quick to clarify that their findings are not definitive evidence. They acknowledge that their experimental method isn’t foolproof and that OpenAI could have gathered paywalled book excerpts from users copying and pasting them into ChatGPT.

Co-authors Did Not Evaluate OpenAI’s Latest Models

Complicating matters further, the co-authors did not assess OpenAI’s latest models, including GPT-4.5 and “reasoning” models like o3-mini and o1. It’s possible that these newer models were not trained on paywalled O’Reilly books, or were trained on a smaller portion of such data compared to GPT-4o.

That said, it’s well-known that OpenAI has been actively seeking higher-quality training data, advocating for fewer restrictions on using copyrighted content. The company has even hired journalists to help refine its models’ outputs. This trend is reflected across the AI industry, with companies recruiting experts in fields like science and physics to incorporate their knowledge into AI systems.

It’s important to note that OpenAI does pay for at least some of its training data, with licensing agreements in place with news publishers, social networks, stock media libraries, and others. The company also provides opt-out mechanisms, although imperfect, allowing copyright holders to flag content they prefer not to be used for training.

Nevertheless, as OpenAI faces multiple lawsuits regarding its training data practices and the handling of copyright law in U.S. courts, the O’Reilly paper adds further scrutiny to the company’s approach.


Read the original article on: TechCrunch

Read more: OpenAI Intends to Launch a New Open AI Language Model in the Next Few Months

Share this post

Leave a Reply