OpenAI’s New AI Models Are Hallucinating More Than Expected

OpenAI’s newly released o3 and o4-mini AI models push the boundaries of current technology in many ways. However, despite their advancements, these models are showing an increase in hallucinations — making up incorrect or fictional information — more often than some of the company’s older models.
Hallucinations Remain a Persistent Challenge
Hallucinations remain one of the most persistent challenges in artificial intelligence, even for the most cutting-edge systems. Historically, each new generation of models has slightly improved in this area, showing fewer hallucinations than their predecessors. But this trend seems to have reversed with the o3 and o4-mini.
According to OpenAI’s internal evaluations, these “reasoning” models hallucinate more frequently than earlier reasoning models like o1, o1-mini, and o3-mini, as well as traditional non-reasoning models like GPT-4o.
More concerning, OpenAI itself doesn’t fully understand why this is happening.
In the technical documentation for o3 and o4-mini, the company admits that “more research is needed” to determine why hallucination rates are increasing as reasoning capabilities scale up. While the models perform better on certain tasks — such as coding and math — they also produce more overall claims, which leads to both more correct and more incorrect information being generated.
Benchmarking Shows Significant Hallucination Rates
On PersonQA, OpenAI’s internal benchmark that evaluates a model’s knowledge about people, o3 hallucinated in 33% of its responses — more than double the rate of o1 (16%) and o3-mini (14.8%). O4-mini performed even worse, hallucinating in 48% of responses.
Independent testing from nonprofit AI research lab Transluce also flagged issues. In one case, the o3 model claimed it had run code on a 2021 MacBook Pro “outside of ChatGPT,” and used those results in its answer — something the model is not actually capable of doing.
Neil Chowdhury, a Transluce researcher and former OpenAI employee, suggested that the reinforcement learning approach used for the o-series might amplify issues typically reduced during the final training phases. Transluce co-founder Sarah Schwettmann also noted that the high hallucination rate could limit the model’s usefulness.
Kian Katanforoosh, a Stanford adjunct professor and CEO of the upskilling platform Workera, told TechCrunch that his team is already testing o3 in their coding workflows. While the model performs well in general, they’ve noticed it often hallucinates broken links — providing URLs that don’t actually work.
Hallucinations Undermine Trust in Accuracy-Critical Fields
Although hallucinations can sometimes lead to creative or novel ideas, they pose a significant problem in fields where precision is non-negotiable. For example, a law firm would be unlikely to adopt a model that risks inserting factual errors into contracts.
One promising method for improving model accuracy is integrating web search capabilities. OpenAI’s GPT-4o with search access, for instance, achieves 90% accuracy on SimpleQA, one of the company’s internal benchmarks. In theory, giving reasoning models access to real-time search could help reduce hallucinations — assuming users are comfortable with third-party access to their prompts.
If scaling reasoning models continues to worsen hallucination rates, the pressure to find a solution will only grow.
“Reducing hallucinations across all of our models is an ongoing research priority, and we’re constantly working to enhance their accuracy and reliability,” said OpenAI spokesperson Niko Felix in an email to TechCrunch.
Over the past year, the AI industry has shifted focus toward reasoning models, especially as improvements in traditional models began yielding diminishing returns. Reasoning offers performance boosts across many tasks without the need for enormous data or computing power. However, it now appears that this path brings a new challenge: a higher tendency to hallucinate.
Read the original article on: Techcrunch
Read more: OpenAI Introduces Flex Processing for More Affordable, Slower AI Tasks
Leave a Reply