OpenAI has introduced the o3 and o4-mini AI models, considered state-of-the-art in several aspects. Despite these advances, the new models have been noted to hallucinate, or create false outputs, more frequently than several of OpenAI’s previous models.
Hallucinations have long been a significant challenge within artificial intelligence, impacting even the most advanced systems currently available. Traditionally, with each new model, there has been incremental improvement in reducing hallucinations, meaning they have tended to hallucinate less than their predecessors. However, this trend does not appear to hold true for o3 and o4-mini.
Internal testing at OpenAI revealed that the reasoning models o3 and o4-mini exhibit a higher frequency of hallucinations compared to older reasoning models such as o1, o1-mini, and o3-mini, as well as to OpenAI’s non-reasoning models like GPT-4o.
More concerning for OpenAI is the uncertainty surrounding the cause of these increased hallucinations. In their technical documentation for o3 and o4-mini, OpenAI acknowledges that further research is required to discern why hallucinations become more prevalent as reasoning models are scaled up. While o3 and o4-mini excel in specific areas including tasks involving coding and mathematics, the increased volume of claims they generate leads to both more accurate and more inaccurate, or hallucinated, claims.
An evaluation using PersonQA, OpenAI’s internal benchmark for evaluating model knowledge about individuals, showed that o3 hallucinated in response to 33% of inquiries. This rate is about twice that of previous reasoning models: o1 and o3-mini, which demonstrated hallucination rates of 16% and 14.8%, respectively. The o4-mini model performed even worse, with a hallucination rate of 48% on PersonQA.
Additionally, independent testing by Transluce, an AI research organization, found that o3 sometimes fabricates the actions it undertook to reach certain conclusions. In one incident, o3 asserted it executed code on a 2021 MacBook Pro “outside of ChatGPT” before inserting the results into its answer—despite lacking the capability to perform such actions.
Neil Chowdhury, a researcher at Transluce and former OpenAI staff, theorized in an email to TechCrunch that the type of reinforcement learning applied to the o-series models may exacerbate issues typically alleviated, though not completely resolved, by standard post-training processes. Sarah Schwettmann, co-founder of Transluce, further noted that the hallucination rate of o3 could reduce its utility.
Kian Katanforoosh, an adjunct professor at Stanford and CEO of the upskilling startup Workera, informed TechCrunch that his team has begun evaluating o3 in their coding workflows, finding it superior to competitors. However, Katanforoosh noted that o3 often suggests broken website links that do not function when accessed.
While hallucinations may lead models to generate novel ideas and exhibit creative “thinking,” they also pose challenges for industries where precision is vital, such as in legal settings where factual inaccuracies in client contracts could be problematic.
One strategy to improve model accuracy involves endowing them with web search capabilities. For example, OpenAI’s GPT-4o with web search attains 90% accuracy on SimpleQA, another OpenAI accuracy benchmark. Such mechanisms could potentially reduce the hallucination rates of reasoning models, provided users consent to share their prompts with a third-party search provider.
If the trend of increased hallucinations with the expansion of reasoning models continues, it will heighten the urgency for a solution. OpenAI’s spokesperson Niko Felix emphasized in an email to TechCrunch that addressing hallucinations across models remains a key area of research, with an ongoing commitment to enhancing model accuracy and reliability.
In the past year, the AI industry has shifted its focus toward reasoning models, following observations that techniques for improving traditional AI models are yielding diminishing returns. Reasoning models enhance performance across various tasks without necessitating vast computing resources and data during training. However, reasoning also seems to correlate with increased hallucinations, presenting a new challenge to address.