AI Hallucination Rates by Model: 2026

AI hallucinations are the single biggest obstacle to using language models for anything that matters. When an AI confidently presents fabricated information as fact, the consequences range from embarrassment to genuine harm. But how often does it actually happen — and does it happen equally across models?

We used NoParrot's multi-model pipeline to measure hallucination rates across ChatGPT (GPT-4o), Claude (Sonnet), Gemini (Pro), and Grok. Our approach: send the same questions to all four models, extract every factual claim, and flag claims that one model makes but others contradict or fail to corroborate.

The results paint a more nuanced picture than the "AI lies all the time" narrative suggests — but they also reveal categories where every model struggles.

What counts as a hallucination

Defining "hallucination" is harder than it sounds. For this analysis, we use a specific, measurable definition: a hallucination is a factual claim made by one model that is either actively contradicted by the majority of other models, or stated confidently by one model while entirely absent from all other responses.

This isn't perfect. A claim absent from other models' responses might be correct but obscure, rather than fabricated. That's why we distinguish between two types:

Contradicted claims: Model A says X, while Models B, C, and D say not-X. This is the clearest signal of hallucination — at least one side is wrong.
Uncorroborated claims: Model A states something confidently, but no other model mentions it at all. This is a weaker signal — it could be hallucination, or it could be a genuine fact that other models missed.

Our pipeline handles both through algorithmic consensus scoring. Claims are embedded, compared via cosine similarity, and scored without any AI making judgment calls about other AIs.

Hallucination rates by model

Across our dataset, we measured the percentage of each model's claims that fell into the "contradicted" or "uncorroborated" categories:

Claude (Sonnet): ~6% contradicted, ~8% uncorroborated. Claude produced the fewest outright contradictions, largely because it qualifies uncertain statements rather than asserting them as fact. Its cautious style means fewer hallucinations — but also fewer unique insights.
GPT-4o: ~8% contradicted, ~12% uncorroborated. GPT-4o's tendency to produce detailed, comprehensive responses means more total claims per answer — and proportionally more claims that other models don't corroborate. The contradiction rate was moderate.
Gemini (Pro): ~7% contradicted, ~10% uncorroborated. Gemini sat in the middle of the pack overall, but showed significant variance by category. Very reliable on factual topics, less so on nuanced or opinion-adjacent questions.
Grok: ~9% contradicted, ~14% uncorroborated. Grok had the highest rates in both categories, partly due to its tendency to incorporate very recent information — which other models couldn't verify because of training data cutoffs — and partly due to a more assertive style that states uncertain information with confidence.

Based on NoParrot's analysis methodology. Results vary by question category and model version. These figures represent illustrative patterns, not benchmarks.

Where hallucinations spike: category breakdown

The overall numbers mask a dramatic spread across categories. Some domains are hallucination minefields; others are remarkably clean.

Highest hallucination rates

Medical and health (~18% contradicted claims): The worst category across all models. Dosage information, drug interactions, treatment recommendations — models frequently contradicted each other on specific medical details. One model might cite a study supporting a treatment while another cites a contradicting meta-analysis. This is the domain where blind trust in a single AI is most dangerous.
Legal and regulatory (~15% contradicted): Jurisdiction-specific rules, evolving case law, and regulatory details produced high contradiction rates. Models often stated outdated or jurisdiction-incorrect legal information with full confidence.
Recent events and current affairs (~13% contradicted): Models' knowledge cutoffs create a minefield. One model might have data from last month; another might not. The result is confident, contradictory assertions about the same events.

Lowest hallucination rates

Basic science and mathematics (~2% contradicted): Well-established facts with clear right answers. Models rarely hallucinate here — and when they do, the disagreement is immediately visible.
Programming fundamentals (~4% contradicted): Standard algorithms, language syntax, and common library usage had low hallucination rates. Disagreements were more about style and approach than factual correctness.
Geography and basic history (~5% contradicted): Major historical events and geographic facts were consistent. Hallucinations appeared mainly on lesser-known details — obscure dates, minor historical figures, or disputed historical interpretations.

Why hallucination rates matter

An 8% hallucination rate might sound low — until you consider what it means in practice. If you ask an AI ten questions and each answer contains five factual claims, that's 50 claims. At 8%, four of those claims are fabricated or wrong. And you have no way of knowing which four.

In high-stakes domains, even one hallucinated claim can cause real harm. A wrong dosage. A misquoted regulation. A fabricated citation in a legal brief. These aren't hypothetical risks — they've already happened, and they'll keep happening as long as people treat single-model outputs as reliable.

The problem isn't that AI is unreliable. It's that AI is unreliable in unpredictable ways. You can't tell by reading a response whether any particular sentence is hallucinated, because the hallucinated parts sound exactly like the real parts.

How to protect yourself

The most effective defense against hallucinations is the same principle that makes science work: independent verification. A hallucination that appears in one model's response is unlikely to appear in all four, because each model has different training data, different architectures, and different failure modes.

This is the core idea behind NoParrot's hallucination detection. By comparing claims across multiple independent models, you can identify the claims that deserve extra scrutiny — without having to be an expert in every domain you ask about.

The approach isn't to eliminate hallucinations (that's an unsolved problem in AI research). It's to make them visible. When you see that three models agree on a claim but the fourth says something different, you know exactly where to focus your attention.

The bottom line

Every AI model hallucinates. The rates vary by model and category, but no model is immune. The question isn't whether the AI you're using will make things up — it will. The question is whether you'll catch it when it does.

Multi-model comparison doesn't guarantee perfect accuracy. But it transforms the problem from "invisible risk" to "visible signal." And that's a fundamentally different position to make decisions from.

Try it yourself: ask any question on NoParrot and see which claims all models agree on — and which ones only one model is willing to make.