OpenAI has faced accusations from various parties regarding the training of its AI models on copyrighted content without obtaining permission. A recent paper by an AI watchdog organization alleges that the company has increasingly depended on non-public books, which it did not license, to develop more advanced AI models.
AI models function as complex prediction engines. They are trained on extensive data, including books, movies, and TV shows, which allow them to learn patterns and methods for making inferences from simple prompts. When a model generates an essay on a Greek tragedy or creates Ghibli-style images, it is drawing from its extensive knowledge base rather than creating something entirely new.
Several AI labs, including OpenAI, have started utilizing AI-generated data for training as they deplete real-world data sources, primarily from the public web. However, few have completely abandoned real-world data due to the risks associated with relying solely on synthetic data, such as reduced model performance.
The paper, coming from the AI Disclosures Project, a nonprofit organization co-founded in 2024 by media executive Tim O’Reilly and economist Ilan Strauss, suggests that OpenAI might have used paywalled books from O’Reilly Media to train its GPT-4o model. Notably, O’Reilly is the CEO of O’Reilly Media, and there is no licensing agreement between O’Reilly and OpenAI, according to the paper.
In the ChatGPT platform, GPT-4o is the default model. The paper’s co-authors assert that GPT-4o displays significant recognition of content from paywalled O’Reilly books compared to an earlier model, GPT-3.5 Turbo, which shows a greater recognition of publicly accessible O’Reilly book samples.
To identify copyrighted content in language models’ training data, the paper employed a method known as DE-COP, first introduced in an academic publication in 2024. This technique, also referred to as a “membership inference attack,” tests a model’s ability to differentiate between human-authored texts and AI-generated versions of the same text, indicating the model’s prior knowledge of the text in its training dataset.
The co-authors, including O’Reilly, Strauss, and AI researcher Sruly Rosenblat, examined OpenAI models, such as GPT-4o and other versions, for knowledge of O’Reilly Media books published before and after their training cutoff dates. Using 13,962 paragraph excerpts from 34 O’Reilly books, they assessed the likelihood that specific excerpts were part of a model’s training dataset.
The paper’s findings reveal that GPT-4o recognized substantially more paywalled O’Reilly book content compared to older models like GPT-3.5 Turbo, even after considering factors such as enhancements in newer models’ ability to recognize human-authored text.
The co-authors caution that their findings do not provide definitive proof. They acknowledge the possibility that OpenAI may have obtained the paywalled book excerpts through users inputting them into ChatGPT.
Further complicating the issue, the study did not evaluate OpenAI’s latest models, including GPT-4.5 and other reasoning models such as o3-mini and o1. It remains uncertain whether these newer models were trained on limited amounts of paywalled O’Reilly book data.
It is widely known that OpenAI has been advocating for more relaxed restrictions on using copyrighted data for model development. The company has sought higher-quality training data and has engaged journalists to enhance model outputs. This practice is part of a broader industry trend, with AI companies enlisting experts from various fields to contribute their knowledge to AI systems.
OpenAI reportedly pays for some of its training data, having secured licensing agreements with news publishers, social networks, and stock media libraries, among others. The company also provides opt-out mechanisms, though imperfect, allowing copyright holders to request exclusion from training datasets.
Despite these measures, OpenAI is currently involved in several lawsuits related to its training data practices and copyright law adherence in U.S. courts. The O’Reilly paper does not present a favorable image for the company.
OpenAI declined to comment on the situation.