Essay·May 2026

Are AI hallucinations fixed yet?

What got better by 2026, what did not, and the setups that cut made-up answers.

By the benchr team · Reviewed May 30, 2026 · View changelog · Figures verified against official sources, 30 May 2026

The honest answer is no. The interesting answer is where the problem went.

Two numbers tell the whole story, and they point in opposite directions. On one test, the best model in 2026 makes things up less than 2% of the time. On another test from the same year, a brand-new reasoning model made things up almost half the time. Both numbers are real. The difference is the task, and that difference is the thing most "hallucinations are basically solved" takes quietly skip over.

The number that looks like a win

Vectara keeps a public leaderboard that measures how often a model invents something when it summarizes a document you give it. As of the May 11, 2026 update, the leader is Ant Group's finix_s1_32b at a 1.8% hallucination rate. Behind it, OpenAI's gpt-5.4-nano sits at 3.1% and Google's gemini-2.5-flash-lite at 3.3%. On the older, shorter-document version of the same test, a model once hit 0.7%. Those are great scores. They are also the answer to a narrow question.

Here's the catch you have to hold onto: this test hands the model the source text and grades whether the summary stays faithful to it. It's closer to reading comprehension than to knowing facts about the world. The model isn't being asked "what's true," it's being asked "does this summary match the page in front of you." That's a real skill, and a useful one, but it isn't the thing people mean when they ask whether AI makes stuff up. Most models on that same leaderboard still land in the 7% to 13% range even on this friendly task.

The number that should worry you

Now the other direction. On OpenAI's own PersonQA factual benchmark, the newer reasoning models hallucinated more than the older one. The o1 model scored 16%. Its successor o3 jumped to 33%. And o4-mini hit 48%, wrong on roughly half its answers. This is OpenAI's own system card, not a critic's cherry-pick. It kills the assumption a lot of people carry around, that each new model is automatically more truthful than the last.

OpenAI didn't fully explain it. The system card says more research is needed and offers the partial mechanism above: o3 makes more claims, so it racks up more of both kinds. The trend did reverse later. Per the GPT-5 system card, GPT-5 with web search is about 45% less likely to make a factual error than GPT-4o, and GPT-5-thinking makes over 5x fewer factual errors than o3. So the trajectory is real and it's pointed the right way. It's just not a flat line, and "newer" was never a guarantee.

There is no single hallucination rate. There's a rate for a task, and anyone quoting one number without the task attached is selling you something.

Hallucination rate by setting

Put the two worlds side by side and the spread is the headline. The same families of models look almost solved on grounded summarization and shaky on open-ended recall. This is why benchr keeps pushing back on clean leaderboard numbers in why benchmarks stopped telling you much: the score is only as honest as the task behind it.

Hallucination rates depend on the task, not just the model, May 2026
Setting / model	Hallucination rate	What the test measures
Grounded summary: finix_s1_32b (Ant Group)	1.8%	Stays faithful to a document you provide
Grounded summary: gpt-5.4-nano (OpenAI)	3.1%	Stays faithful to a document you provide
Grounded summary: gemini-2.5-flash-lite (Google)	3.3%	Stays faithful to a document you provide
Grounded summary: older shorter-doc test (Gemini-2.0-Flash)	0.7%	Same skill, easier original dataset
Open-ended facts: o1 (PersonQA)	16%	Recalls facts with no source provided
Open-ended facts: o3 (PersonQA)	33%	Recalls facts with no source provided
Open-ended facts: o4-mini (PersonQA)	48%	Recalls facts with no source provided

Read the table as one point: give the model the answer to read and it rarely misquotes it. Ask it to pull the answer from memory and it's a different machine. The numbers aren't contradictory, they're measuring two jobs that happen to share a word.

Why it isn't solved, in OpenAI's own words

The clearest tell that the field doesn't consider this done comes from OpenAI's September 2025 paper, "Why Language Models Hallucinate." Its argument is about incentives, not magic. Most benchmarks grade an answer as simply right or wrong, and "I don't know" scores a flat zero, the same as a wrong answer. Under that math, a model that guesses beats a model that admits it isn't sure. So models get trained to bluff with confidence, because bluffing is statistically the winning move on the test.

The fix the authors propose isn't a new hallucination filter. It's to change how mainstream evaluations are scored: give partial credit for a well-placed "I don't know," and actively penalize confident wrong answers under an explicit confidence threshold. When a publisher proposes rewriting the scoring of every benchmark rather than shipping a patch, that's the field telling you the problem is structural. It's not a bug a single release closes.

What actually cuts made-up answers

Here's the practical part, and it's the same advice that's worked for two years now. You don't fix hallucinations by waiting for a smarter model. You fix them by not asking the model to work from memory in the first place.

The reliable move is grounding: retrieve the relevant source text, hand it to the model, and require it to answer from that text and cite it. This is retrieval-augmented generation, and it's exactly why the Vectara summarization scores look so much better than PersonQA. Both are the same trick, the model is given the source. If you're weighing whether to retrieve sources or bake knowledge into the weights, benchr's take on RAG versus fine-tuning covers when each one earns its place. For most factual workloads, retrieval wins because it shows its work.

Three rules make grounding hold up. First, invest in retrieval quality, reranking and hybrid search, because the model can only be as accurate as the passages you feed it. Second, instruct it plainly: if the answer isn't in the provided sources, say "I don't know" instead of filling the gap. Third, add a verification pass that checks the output against the sources before anyone trusts it. Citations help here, since they let a reader audit the claim, but a citation next to a sentence doesn't prove the sentence is true. Skip the pass and you're trusting a footnote you never checked.

One more practical note, since it cuts the other way. The same grounding that reduces made-up facts is also what gets your pages quoted by assistants, because retrieval is how those systems decide what to pull in. If that's a goal, benchr's guide to getting cited inside AI answers is the companion piece. Clear, well-sourced pages are easier for a model to ground on, which is good for accuracy and good for visibility at the same time.

So, fixed or not?

Not fixed. Better, in places, and worse than the hype in others. The progress is real: flagship models with web search make far fewer factual errors than their predecessors did a year earlier, and grounded summarization is close to a solved task. The gap is open-ended recall, where even fresh reasoning models can swing wildly, and where the field's own scoring still quietly rewards a confident guess. Buy on that basis. Ground your factual workloads, force abstention, verify the output, and never repeat a single "the hallucination rate is X%" figure without saying which task it came from.

Frequently asked

Are AI hallucinations fixed in 2026?

No. The best models stay grounded much better than they used to. On Vectara's document-summarization leaderboard, updated May 11, 2026, the top model hallucinates just 1.8% of the time. But that's a narrow summarize-this-text test. On open-ended factual questions, error rates run far higher, and OpenAI's own September 2025 paper argues the problem isn't solved because evaluations still reward confident guessing over admitting uncertainty.

Which AI model hallucinates the least right now?

On Vectara's current Hallucination Leaderboard, Ant Group's finix_s1_32b leads at 1.8%, then OpenAI's gpt-5.4-nano at 3.1% and Google's gemini-2.5-flash-lite at 3.3%. The catch: this only measures how faithfully a model summarizes a document you hand it, not how often it gets general facts right. On the older shorter-document version of the test, Gemini-2.0-Flash reached 0.7%.

Do reasoning models hallucinate more than older models?

Not always. In OpenAI's own o3 and o4-mini system card, the newer reasoning models hallucinated more on the PersonQA factual benchmark: o1 scored 16%, o3 jumped to 33%, and o4-mini hit 48%. OpenAI said more research was needed and offered only a partial explanation, that o3 makes more claims overall. Later, GPT-5 reversed the trend: GPT-5-thinking makes over 5x fewer factual errors than o3 per OpenAI's GPT-5 system card.

Does RAG stop hallucinations?

It cuts them, but it doesn't end them. Grounding the model in retrieved source text, then telling it to cite the documents and to say "I don't know" when the answer isn't there, keeps it on the provided context instead of guessing from memory. That's also why models score so much better on Vectara's summarization test than on open-ended fact questions: they're handed the source up front. Citations let users audit answers, but they don't guarantee the answer is right.

How do you reduce AI hallucinations?

Ground the model in retrieved source text, require it to cite that text, and instruct it to abstain when the answer isn't in the sources. Add a verification pass over the output before you trust it. Better retrieval, reranking and hybrid search, raises the quality of what the model reads, which is the lever that actually moves the error rate.

Changelog

May 30, 2026 — Originally published. Vectara leaderboard figures, PersonQA system-card rates, the GPT-5 relative-error claims, and the "Why Language Models Hallucinate" argument verified against the named sources below.

References

Vectara, "Hallucination Leaderboard," github.com, last updated May 11, 2026, accessed May 2026.
AIbase, "Vectara Hallucination Leaderboard (HHEM-2.1) original dataset," aibase.com, accessed May 2026.
OpenAI, "o3 and o4-mini System Card," April 16, 2025, cdn.openai.com, accessed May 2026.
Simon Willison, "OpenAI o3 and o4-mini System Card," simonwillison.net, accessed May 2026.
OpenAI, "GPT-5 System Card," August 13, 2025, cdn.openai.com, accessed May 2026.
A. Kalai, O. Nachum, S. Vempala, E. Zhang, "Why Language Models Hallucinate," openai.com, September 2025, accessed May 2026.
A. Kalai, O. Nachum, S. Vempala, E. Zhang, "Why Language Models Hallucinate," arXiv 2509.04664, arxiv.org, accessed May 2026.
Towards Data Science, "5 Techniques to Prevent Hallucinations in Your RAG Question Answering," towardsdatascience.com, accessed May 2026.