<p>I used to be able to spot an LLM hallucination a mile away. The writing had a whiff about it — too confident, slightly off-topic, a citation that didn’t quite exist. You’d read it, squint, and go: <em>no, that’s not right</em>.</p>
<p>That job is getting harder. Not because the models have stopped making things up. They haven’t. The problem is that they’ve got much better at making things up <em>plausibly</em>.</p>
<h3>Why this is a problem we can’t engineer away</h3>
<p>Every large language model — Claude, ChatGPT, Gemini, Copilot, the lot — is at its core a probability engine. It predicts the “next token” based on everything that came before. That’s a remarkable trick when the training data is rich and the question is well-posed. It’s less useful when the model has to reach for something outside its training or retrieval context, because the same machinery that produces a correct answer also produces a confident-sounding wrong one. The models can represent uncertainty internally — token probabilities give you something to work with — but there’s no reliable mechanism that surfaces “I don’t know” to the user. It’s a calibration and training-incentive problem, not a capability one.</p>
<p>This isn’t just my view. In September 2025, OpenAI’s own researchers <a href="
https://arxiv.org/abs/2509.04664">published a paper arguing that hallucinations can be understood, in part, as errors in binary classification</a> — a predictable consequence of how models are trained and how we evaluate them. Accuracy-only leaderboards, they point out, reward confident guessing over saying “I don’t know.” Other researchers have gone further: one group <a href="
https://arxiv.org/abs/2401.11817">showed, under formal assumptions about computability and training data, that hallucination is unavoidable</a> in any computable LLM, while <a href="
https://arxiv.org/abs/2409.05746">another used Gödel-style reasoning</a> to argue that hallucination risk persists across every stage of the LLM pipeline. Even an earlier <a href="
https://arxiv.org/abs/2311.14648">STOC 2024 paper by Kalai and Vempala</a> suggested that rare facts in training data place a lower bound on the hallucination rate of any well-calibrated model. These are theoretical results with real assumptions behind them, but the direction is consistent: hallucination isn’t going to be patched out.</p>
<p>Reasoning models — the frontier stuff from the last eighteen months — help, but they don’t solve it. What they do is take more turns: they draft, critique, redraft, search, reconsider. Given enough compute and enough tool access (web search, code execution, document retrieval), a reasoning model can catch its own errors. Given a single quick turn on a question outside its training data, it often won’t.</p>
<p>And there’s a twist. <a href="
https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf">OpenAI’s own system card for o3 and o4-mini</a>, published in April 2025, reported that o3 hallucinates around 33% of the time on the PersonQA benchmark — a difficult factual-recall task — and o4-mini around 48%, roughly double the rate of their o1 and o3-mini predecessors. That’s one benchmark, not a universal regression, but <a href="
https://www.nytimes.com/2025/05/05/technology/ai-hallucinations-chatgpt-google.html">the <em>New York Times</em> covered the trend in May 2025</a>: on several benchmarks, hallucination rates in newer reasoning models have gone up, not down. Models are tackling harder tasks and attempting more claims; the absolute number of confident-but-wrong statements rises along with the confident-and-right ones. So the ability to <em>sound right</em> is increasing faster than the ability to <em>be right</em>. That’s the problem.</p>
<h3>What a hallucination actually looks like now</h3>
<p>Two years ago an obvious hallucination might have been a wrong date, a fabricated quote, a made-up case law reference. You could fact-check it in thirty seconds.</p>
<p>Today’s failures are subtler. A Python script that uses a function with the right name but wrong arguments. A network configuration that’s valid syntax but wrong for your vendor’s firmware version. A summary of a document that captures the shape of the argument but inverts one crucial claim. A legal clause that reads professionally but cites the wrong Act. The shell is convincing. The filling isn’t.</p>
<p>I’ve seen this in my own work, more frequently than I did even a year ago, and colleagues have noticed the same thing. An LLM recently drafted a paragraph for a colleague about a piece of telecoms hardware, confidently asserting that it was “the same ATA our CloudPBX team already ships into office deployments.” We don’t ship it. The model had never been told we did. It reads as the kind of thing a helpful internal colleague would write — specific, plausible, in the right voice — which is exactly what makes it dangerous. In another session, pushed on where a particular claim had come from, the model simply admitted it had made the line up: it had extrapolated from a third-party review and the general reputation of a similar product, then written it in first-person company voice because it sounded convincing. That’s an unusually honest self-report and worth taking seriously. It’s the machinery describing itself: <em>I wrote what sounded right</em>.</p>
<p>This is where the structural point from the last section stops being abstract. It’s also something people often miss: even when you give an LLM good source material to work from, it doesn’t just quote you back. It summarises, re-references, and reformulates internally — and things get distorted in that process. The research community calls this <em>faithfulness</em> hallucination, to distinguish it from the making-stuff-up variety. <a href="
https://dl.acm.org/doi/10.1145/3703155">The canonical survey by Huang et al.</a> describes it as “context inconsistency” — cases where the model “ignores or alters important facts within the original text.” A paper from <a href="
https://arxiv.org/abs/2505.15291">January 2026 found a tendency for faithfulness to degrade toward the <em>end</em> of longer responses</a> — the model drifts from the source the further it gets into its own output. So that long, fluent, well-cited summary of your policy document? The bit at the bottom is, on the evidence, the most likely place for something to be subtly wrong.</p>
<p><a href="
https://www.oneusefulthing.org/p/15-times-to-use-ai-and-5-not-to">Ethan Mollick has made a related point</a>: because LLM errors are, by construction, plausible, users “fall asleep at the wheel.” And it’s not a new concern — a <a href="
https://www.science.org/doi/10.1126/sciadv.adh1850">2023 <em>Science Advances</em> study</a> found that in controlled conditions, participants were more likely to believe AI-generated disinformation than the human-written equivalent. The prose was simply better. That was GPT-3. The gap has only widened.</p>
<p>This is not a failure mode anyone can detect by vibe-checking the output.</p>
<h3>What actually works</h3>
<p>There’s a temptation to treat this as a novel problem requiring novel tools. It isn’t. Academia has been dealing with confident-sounding-but-wrong writing for about four hundred years, and the mechanism we landed on is peer review. Someone else reads your work and tries to break it.</p>
<p>That’s exactly what works with AI output too. A few patterns I rely on:</p>
<p><strong>Ask the same model to critique its own answer, in a fresh session.</strong> Not “check your work” — that rarely does much. Instead: “Here’s a proposed solution to X. What’s wrong with it? What assumptions is it making? What would cause this to fail?” The fresh session matters, because the model isn’t anchored to defending what it just wrote. This is the idea behind <a href="
https://arxiv.org/abs/2303.17651">Self-Refine</a>, a NeurIPS 2023 paper that reported improvements of around 20% on several tasks from iterative self-critique, without any additional training. There’s a catch, though: <a href="
https://arxiv.org/abs/2402.08115">a 2024 paper from Stechly, Valmeekam and Kambhampati</a> found that pure self-critique can actually make reasoning performance <em>worse</em> on some tasks — the model talks itself into the wrong answer. Which leads to the next pattern.</p>
<p><strong>Cross-examine with a different model.</strong> If I’ve had Claude draft a tricky piece of network config, I’ll often paste it into ChatGPT or Gemini and ask what it would do differently and why. The disagreements are where the interesting errors live. The models have different training data, different reasoning styles, different blind spots. Where they all agree, my confidence goes up — but it’s not proof. Frontier models are trained on heavily overlapping corpora and can confidently agree on the same wrong answer, especially on topics where the public internet is itself wrong. So agreement is a signal, not a guarantee; disagreement is what tells me I need to go and read the documentation myself. This pattern is exactly the approach taken in <a href="
https://arxiv.org/abs/2305.14325">Du et al.’s “Multiagent Debate” paper at ICML 2024</a>, which found that having multiple LLM instances debate an answer over several rounds measurably reduces hallucination. Google DeepMind’s <a href="
https://deepmind.google/blog/facts-grounding-a-new-benchmark-for-evaluating-the-factuality-of-large-language-models/">FACTS Grounding benchmark</a> operationalises the same idea at institutional scale, using an ensemble of Gemini, GPT-4o and Claude to judge factuality.</p>
<p><strong>Make the model do the research.</strong> For anything involving facts that change — pricing, versions, API behaviour, legislation — don’t trust the weights. Make the model search, cite the source, and then check the source actually says what it claims. Retrieval doesn’t eliminate hallucination — an <a href="
https://aclanthology.org/2023.emnlp-main.398/">EMNLP 2023 benchmark called ALCE</a> found that even the best systems miss full citation support about 50% of the time, and as noted above, even with a good source document in hand, models can drift from what it actually says — but it massively reduces fabrication when combined with a human checking the citations. Vendors are starting to build this in: <a href="
https://simonwillison.net/2025/Jan/24/anthropics-new-citations-api/">Anthropic’s Citations API</a>, launched in January 2025, returns direct quotes from supplied source documents, and at least one customer reported reducing “source confabulations from 10% to zero.”</p>
<p><strong>Build the friction in before you need it.</strong> The cheapest time to catch a hallucination is before you’ve acted on it. That means reviewing AI output <em>before</em> you paste it into the production config, send it to the client, or quote it in the board paper. <a href="
https://simonwillison.net/2025/Mar/2/hallucinations-in-code/">Simon Willison puts it well</a>: hallucinated code is actually the <em>least</em> dangerous failure mode, because the compiler tells you immediately. It’s the hallucinations that read clean and pass initial review that hurt. Treat the first draft as a draft.</p>
<h3>What this means for how we use AI at work</h3>
<p>At RWTS we do a lot of work where the cost of a confident wrong answer is high — network changes, security configurations, compliance documents. The teams who get good results with AI aren’t the ones who’ve bought the most expensive model. They’re the ones who’ve built review into the workflow. A senior engineer reviews the AI-drafted config. A peer reads the AI-drafted proposal before it goes out. The AI is a junior colleague with a photographic memory and occasional blind spots, not an oracle.</p>
<p>That framing isn’t mine — <a href="
https://www.oneusefulthing.org/p/on-boarding-your-ai-intern">Ethan Mollick has been arguing for years</a> that LLMs are best thought of as “weird, somewhat alien interns that work infinitely fast and sometimes lie to make you happy.” <a href="
https://simonwillison.net/2025/Oct/7/vibe-engineering/">Simon Willison goes further</a> and calls them “a growing army of weird digital interns who will absolutely cheat if you give them a chance.” The operational answer, in both cases, is the same: tests, specs, code review, and a human in the loop who actually knows the domain.</p>
<p>That’s the one I’d offer to anyone trying to work out how much to trust these tools. You wouldn’t let a talented graduate push code to production unreviewed on day one. You wouldn’t let them send a client a statement of work without someone else reading it. The same rules apply.</p>
<p>Hallucinations aren’t going away. They’re a property of how these systems work, not a bug that’ll be patched out in the next release. The models will keep getting better at sounding right. Our job is to make sure the review process keeps up.</p>
<hr />
<p><em>A note on process: I drafted this post with Claude’s help, then had a different model review it for overstated claims and weak citations. Several paragraphs got tightened as a result. Which is, roughly, the point.</em></p>
<hr />
<p><em>I’m the Director and CTO over at <a href="
https://rwts.com.au">Real World Technology Solutions</a>. RWTS helps organisations get real value from AI without betting the business on it. Call us on 1300 798 718.</em></p>
I used to be able to spot an LLM hallucination a mile away. The writing had a whiff about it — too confident, slightly off-topic, a citation that didn’t quite exist. You’d read it, squint, and go: no, that’s not right.
That job is getting harder. Not because the models have stopped making things up. They haven’t. The problem is that they’ve got much better at making things up plausibly.
Why this is a problem we can’t engineer away
Every large language model — Claude, ChatGPT, Gemini, Copilot, the lot — is at its core a probability engine. It predicts the “next token” based on everything that came before. That’s a remarkable trick when the training data is rich and the question is well-posed. It’s less useful when the model has to reach for something outside its training or retrieval context, because the same machinery that produces a correct answer also produces a confident-sounding wrong one. The models can represent uncertainty internally — token probabilities give you something to work with — but there’s no reliable mechanism that surfaces “I don’t know” to the user. It’s a calibration and training-incentive problem, not a capability one.
This isn’t just my view. In September 2025, OpenAI’s own researchers published a paper arguing that hallucinations can be understood, in part, as errors in binary classification — a predictable consequence of how models are trained and how we evaluate them. Accuracy-only leaderboards, they point out, reward confident guessing over saying “I don’t know.” Other researchers have gone further: one group showed, under formal assumptions about computability and training data, that hallucination is unavoidable in any computable LLM, while another used Gödel-style reasoning to argue that hallucination risk persists across every stage of the LLM pipeline. Even an earlier STOC 2024 paper by Kalai and Vempala suggested that rare facts in training data place a lower bound on the hallucination rate of any well-calibrated model. These are theoretical results with real assumptions behind them, but the direction is consistent: hallucination isn’t going to be patched out.
Reasoning models — the frontier stuff from the last eighteen months — help, but they don’t solve it. What they do is take more turns: they draft, critique, redraft, search, reconsider. Given enough compute and enough tool access (web search, code execution, document retrieval), a reasoning model can catch its own errors. Given a single quick turn on a question outside its training data, it often won’t.
And there’s a twist. OpenAI’s own system card for o3 and o4-mini, published in April 2025, reported that o3 hallucinates around 33% of the time on the PersonQA benchmark — a difficult factual-recall task — and o4-mini around 48%, roughly double the rate of their o1 and o3-mini predecessors. That’s one benchmark, not a universal regression, but the New York Times covered the trend in May 2025: on several benchmarks, hallucination rates in newer reasoning models have gone up, not down. Models are tackling harder tasks and attempting more claims; the absolute number of confident-but-wrong statements rises along with the confident-and-right ones. So the ability to sound right is increasing faster than the ability to be right. That’s the problem.
What a hallucination actually looks like now
Two years ago an obvious hallucination might have been a wrong date, a fabricated quote, a made-up case law reference. You could fact-check it in thirty seconds.
Today’s failures are subtler. A Python script that uses a function with the right name but wrong arguments. A network configuration that’s valid syntax but wrong for your vendor’s firmware version. A summary of a document that captures the shape of the argument but inverts one crucial claim. A legal clause that reads professionally but cites the wrong Act. The shell is convincing. The filling isn’t.
I’ve seen this in my own work, more frequently than I did even a year ago, and colleagues have noticed the same thing. An LLM recently drafted a paragraph for a colleague about a piece of telecoms hardware, confidently asserting that it was “the same ATA our CloudPBX team already ships into office deployments.” We don’t ship it. The model had never been told we did. It reads as the kind of thing a helpful internal colleague would write — specific, plausible, in the right voice — which is exactly what makes it dangerous. In another session, pushed on where a particular claim had come from, the model simply admitted it had made the line up: it had extrapolated from a third-party review and the general reputation of a similar product, then written it in first-person company voice because it sounded convincing. That’s an unusually honest self-report and worth taking seriously. It’s the machinery describing itself: I wrote what sounded right.
This is where the structural point from the last section stops being abstract. It’s also something people often miss: even when you give an LLM good source material to work from, it doesn’t just quote you back. It summarises, re-references, and reformulates internally — and things get distorted in that process. The research community calls this faithfulness hallucination, to distinguish it from the making-stuff-up variety. The canonical survey by Huang et al. describes it as “context inconsistency” — cases where the model “ignores or alters important facts within the original text.” A paper from January 2026 found a tendency for faithfulness to degrade toward the end of longer responses — the model drifts from the source the further it gets into its own output. So that long, fluent, well-cited summary of your policy document? The bit at the bottom is, on the evidence, the most likely place for something to be subtly wrong.
Ethan Mollick has made a related point: because LLM errors are, by construction, plausible, users “fall asleep at the wheel.” And it’s not a new concern — a 2023 Science Advances study found that in controlled conditions, participants were more likely to believe AI-generated disinformation than the human-written equivalent. The prose was simply better. That was GPT-3. The gap has only widened.
This is not a failure mode anyone can detect by vibe-checking the output.
What actually works
There’s a temptation to treat this as a novel problem requiring novel tools. It isn’t. Academia has been dealing with confident-sounding-but-wrong writing for about four hundred years, and the mechanism we landed on is peer review. Someone else reads your work and tries to break it.
That’s exactly what works with AI output too. A few patterns I rely on:
Ask the same model to critique its own answer, in a fresh session. Not “check your work” — that rarely does much. Instead: “Here’s a proposed solution to X. What’s wrong with it? What assumptions is it making? What would cause this to fail?” The fresh session matters, because the model isn’t anchored to defending what it just wrote. This is the idea behind Self-Refine, a NeurIPS 2023 paper that reported improvements of around 20% on several tasks from iterative self-critique, without any additional training. There’s a catch, though: a 2024 paper from Stechly, Valmeekam and Kambhampati found that pure self-critique can actually make reasoning performance worse on some tasks — the model talks itself into the wrong answer. Which leads to the next pattern.
Cross-examine with a different model. If I’ve had Claude draft a tricky piece of network config, I’ll often paste it into ChatGPT or Gemini and ask what it would do differently and why. The disagreements are where the interesting errors live. The models have different training data, different reasoning styles, different blind spots. Where they all agree, my confidence goes up — but it’s not proof. Frontier models are trained on heavily overlapping corpora and can confidently agree on the same wrong answer, especially on topics where the public internet is itself wrong. So agreement is a signal, not a guarantee; disagreement is what tells me I need to go and read the documentation myself. This pattern is exactly the approach taken in Du et al.’s “Multiagent Debate” paper at ICML 2024, which found that having multiple LLM instances debate an answer over several rounds measurably reduces hallucination. Google DeepMind’s FACTS Grounding benchmark operationalises the same idea at institutional scale, using an ensemble of Gemini, GPT-4o and Claude to judge factuality.
Make the model do the research. For anything involving facts that change — pricing, versions, API behaviour, legislation — don’t trust the weights. Make the model search, cite the source, and then check the source actually says what it claims. Retrieval doesn’t eliminate hallucination — an EMNLP 2023 benchmark called ALCE found that even the best systems miss full citation support about 50% of the time, and as noted above, even with a good source document in hand, models can drift from what it actually says — but it massively reduces fabrication when combined with a human checking the citations. Vendors are starting to build this in: Anthropic’s Citations API, launched in January 2025, returns direct quotes from supplied source documents, and at least one customer reported reducing “source confabulations from 10% to zero.”
Build the friction in before you need it. The cheapest time to catch a hallucination is before you’ve acted on it. That means reviewing AI output before you paste it into the production config, send it to the client, or quote it in the board paper. Simon Willison puts it well: hallucinated code is actually the least dangerous failure mode, because the compiler tells you immediately. It’s the hallucinations that read clean and pass initial review that hurt. Treat the first draft as a draft.
What this means for how we use AI at work
At RWTS we do a lot of work where the cost of a confident wrong answer is high — network changes, security configurations, compliance documents. The teams who get good results with AI aren’t the ones who’ve bought the most expensive model. They’re the ones who’ve built review into the workflow. A senior engineer reviews the AI-drafted config. A peer reads the AI-drafted proposal before it goes out. The AI is a junior colleague with a photographic memory and occasional blind spots, not an oracle.
That framing isn’t mine — Ethan Mollick has been arguing for years that LLMs are best thought of as “weird, somewhat alien interns that work infinitely fast and sometimes lie to make you happy.” Simon Willison goes further and calls them “a growing army of weird digital interns who will absolutely cheat if you give them a chance.” The operational answer, in both cases, is the same: tests, specs, code review, and a human in the loop who actually knows the domain.
That’s the one I’d offer to anyone trying to work out how much to trust these tools. You wouldn’t let a talented graduate push code to production unreviewed on day one. You wouldn’t let them send a client a statement of work without someone else reading it. The same rules apply.
Hallucinations aren’t going away. They’re a property of how these systems work, not a bug that’ll be patched out in the next release. The models will keep getting better at sounding right. Our job is to make sure the review process keeps up.
A note on process: I drafted this post with Claude’s help, then had a different model review it for overstated claims and weak citations. Several paragraphs got tightened as a result. Which is, roughly, the point.
I’m the Director and CTO over at Real World Technology Solutions. RWTS helps organisations get real value from AI without betting the business on it. Call us on 1300 798 718.