<p>I used to be able to spot an LLM hallucination a mile away. The writing had a whiff about it — too confident, slightly off-topic, a citation that didn’t quite exist. You’d read it, squint, and go: <em>no, that’s not right</em>.</p>
<p>That job is getting harder. Not because the models have stopped making things up. They haven’t. The problem is that they’ve got much better at making things up <em>plausibly</em>.</p>
<h3>Why this is a problem we can’t engineer away</h3>
<p>Every large language model — Claude, ChatGPT, Gemini, Copilot, the lot — is at its core a probability engine. It predicts the “next token” based on everything that came before. That’s a remarkable trick when the training data is rich and the question is well-posed. It’s less useful when the model has to reach for something outside its training or retrieval context, because the same machinery that produces a correct answer also produces a confident-sounding wrong one. The models can represent uncertainty internally — token probabilities give you something to work with — but there’s no reliable mechanism that surfaces “I don’t know” to the user. It’s a calibration and training-incentive problem, not a capability one.</p>
<p>This isn’t just my view. In September 2025, OpenAI’s own researchers <a href="https://arxiv.org/abs/2509.04664">published a paper arguing that hallucinations can be understood, in part, as errors in binary classification</a> — a predictable consequence of how models are trained and how we evaluate them. Accuracy-only leaderboards, they point out, reward confident guessing over saying “I don’t know.” Other researchers have gone further: one group <a href="https://arxiv.org/abs/2401.11817">showed, under formal assumptions about computability and training data, that hallucination is unavoidable</a> in any computable LLM, while <a href="https://arxiv.org/abs/2409.05746">another used Gödel-style reasoning</a> to argue that hallucination risk persists across every stage of the LLM pipeline. Even an earlier <a href="https://arxiv.org/abs/2311.14648">STOC 2024 paper by Kalai and Vempala</a> suggested that rare facts in training data place a lower bound on the hallucination rate of any well-calibrated model. These are theoretical results with real assumptions behind them, but the direction is consistent: hallucination isn’t going to be patched out.</p>
<p>Reasoning models — the frontier stuff from the last eighteen months — help, but they don’t solve it. What they do is take more turns: they draft, critique, redraft, search, reconsider. Given enough compute and enough tool access (web search, code execution, document retrieval), a reasoning model can catch its own errors. Given a single quick turn on a question outside its training data, it often won’t.</p>
<p>And there’s a twist. <a href="https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf">OpenAI’s own system card for o3 and o4-mini</a>, published in April 2025, reported that o3 hallucinates around 33% of the time on the PersonQA benchmark — a difficult factual-recall task — and o4-mini around 48%, roughly double the rate of their o1 and o3-mini predecessors. That’s one benchmark, not a universal regression, but <a href="https://www.nytimes.com/2025/05/05/technology/ai-hallucinations-chatgpt-google.html">the <em>New York Times</em> covered the trend in May 2025</a>: on several benchmarks, hallucination rates in newer reasoning models have gone up, not down. Models are tackling harder tasks and attempting more claims; the absolute number of confident-but-wrong statements rises along with the confident-and-right ones. So the ability to <em>sound right</em> is increasing faster than the ability to <em>be right</em>. That’s the problem.</p>
<h3>What a hallucination actually looks like now</h3>
<p>Two years ago an obvious hallucination might have been a wrong date, a fabricated quote, a made-up case law reference. You could fact-check it in thirty seconds.</p>
<p>Today’s failures are subtler. A Python script that uses a function with the right name but wrong arguments. A network configuration that’s valid syntax but wrong for your vendor’s firmware version. A summary of a document that captures the shape of the argument but inverts one crucial claim. A legal clause that reads professionally but cites the wrong Act. The shell is convincing. The filling isn’t.</p>
<p>I’ve seen this in my own work, more frequently than I did even a year ago, and colleagues have noticed the same thing. An LLM recently drafted a paragraph for a colleague about a piece of telecoms hardware, confidently asserting that it was “the same ATA our CloudPBX team already ships into office deployments.” We don’t ship it. The model had never been told we did. It reads as the kind of thing a helpful internal colleague would write — specific, plausible, in the right voice — which is exactly what makes it dangerous. In another session, pushed on where a particular claim had come from, the model simply admitted it had made the line up: it had extrapolated from a third-party review and the general reputation of a similar product, then written it in first-person company voice because it sounded convincing. That’s an unusually honest self-report and worth taking seriously. It’s the machinery describing itself: <em>I wrote what sounded right</em>.</p>
<p>This is where the structural point from the last section stops being abstract. It’s also something people often miss: even when you give an LLM good source material to work from, it doesn’t just quote you back. It summarises, re-references, and reformulates internally — and things get distorted in that process. The research community calls this <em>faithfulness</em> hallucination, to distinguish it from the making-stuff-up variety. <a href="https://dl.acm.org/doi/10.1145/3703155">The canonical survey by Huang et al.</a> describes it as “context inconsistency” — cases where the model “ignores or alters important facts within the original text.” A paper from <a href="https://arxiv.org/abs/2505.15291">January 2026 found a tendency for faithfulness to degrade toward the <em>end</em> of longer responses</a> — the model drifts from the source the further it gets into its own output. So that long, fluent, well-cited summary of your policy document? The bit at the bottom is, on the evidence, the most likely place for something to be subtly wrong.</p>
<p><a href="https://www.oneusefulthing.org/p/15-times-to-use-ai-and-5-not-to">Ethan Mollick has made a related point</a>: because LLM errors are, by construction, plausible, users “fall asleep at the wheel.” And it’s not a new concern — a <a href="https://www.science.org/doi/10.1126/sciadv.adh1850">2023 <em>Science Advances</em> study</a> found that in controlled conditions, participants were more likely to believe AI-generated disinformation than the human-written equivalent. The prose was simply better. That was GPT-3. The gap has only widened.</p>
<p>This is not a failure mode anyone can detect by vibe-checking the output.</p>
<h3>What actually works</h3>
<p>There’s a temptation to treat this as a novel problem requiring novel tools. It isn’t. Academia has been dealing with confident-sounding-but-wrong writing for about four hundred years, and the mechanism we landed on is peer review. Someone else reads your work and tries to break it.</p>
<p>That’s exactly what works with AI output too. A few patterns I rely on:</p>
<p><strong>Ask the same model to critique its own answer, in a fresh session.</strong> Not “check your work” — that rarely does much. Instead: “Here’s a proposed solution to X. What’s wrong with it? What assumptions is it making? What would cause this to fail?” The fresh session matters, because the model isn’t anchored to defending what it just wrote. This is the idea behind <a href="https://arxiv.org/abs/2303.17651">Self-Refine</a>, a NeurIPS 2023 paper that reported improvements of around 20% on several tasks from iterative self-critique, without any additional training. There’s a catch, though: <a href="https://arxiv.org/abs/2402.08115">a 2024 paper from Stechly, Valmeekam and Kambhampati</a> found that pure self-critique can actually make reasoning performance <em>worse</em> on some tasks — the model talks itself into the wrong answer. Which leads to the next pattern.</p>
<p><strong>Cross-examine with a different model.</strong> If I’ve had Claude draft a tricky piece of network config, I’ll often paste it into ChatGPT or Gemini and ask what it would do differently and why. The disagreements are where the interesting errors live. The models have different training data, different reasoning styles, different blind spots. Where they all agree, my confidence goes up — but it’s not proof. Frontier models are trained on heavily overlapping corpora and can confidently agree on the same wrong answer, especially on topics where the public internet is itself wrong. So agreement is a signal, not a guarantee; disagreement is what tells me I need to go and read the documentation myself. This pattern is exactly the approach taken in <a href="https://arxiv.org/abs/2305.14325">Du et al.’s “Multiagent Debate” paper at ICML 2024</a>, which found that having multiple LLM instances debate an answer over several rounds measurably reduces hallucination. Google DeepMind’s <a href="https://deepmind.google/blog/facts-grounding-a-new-benchmark-for-evaluating-the-factuality-of-large-language-models/">FACTS Grounding benchmark</a> operationalises the same idea at institutional scale, using an ensemble of Gemini, GPT-4o and Claude to judge factuality.</p>
<p><strong>Make the model do the research.</strong> For anything involving facts that change — pricing, versions, API behaviour, legislation — don’t trust the weights. Make the model search, cite the source, and then check the source actually says what it claims. Retrieval doesn’t eliminate hallucination — an <a href="https://aclanthology.org/2023.emnlp-main.398/">EMNLP 2023 benchmark called ALCE</a> found that even the best systems miss full citation support about 50% of the time, and as noted above, even with a good source document in hand, models can drift from what it actually says — but it massively reduces fabrication when combined with a human checking the citations. Vendors are starting to build this in: <a href="https://simonwillison.net/2025/Jan/24/anthropics-new-citations-api/">Anthropic’s Citations API</a>, launched in January 2025, returns direct quotes from supplied source documents, and at least one customer reported reducing “source confabulations from 10% to zero.”</p>
<p><strong>Build the friction in before you need it.</strong> The cheapest time to catch a hallucination is before you’ve acted on it. That means reviewing AI output <em>before</em> you paste it into the production config, send it to the client, or quote it in the board paper. <a href="https://simonwillison.net/2025/Mar/2/hallucinations-in-code/">Simon Willison puts it well</a>: hallucinated code is actually the <em>least</em> dangerous failure mode, because the compiler tells you immediately. It’s the hallucinations that read clean and pass initial review that hurt. Treat the first draft as a draft.</p>
<h3>What this means for how we use AI at work</h3>
<p>At RWTS we do a lot of work where the cost of a confident wrong answer is high — network changes, security configurations, compliance documents. The teams who get good results with AI aren’t the ones who’ve bought the most expensive model. They’re the ones who’ve built review into the workflow. A senior engineer reviews the AI-drafted config. A peer reads the AI-drafted proposal before it goes out. The AI is a junior colleague with a photographic memory and occasional blind spots, not an oracle.</p>
<p>That framing isn’t mine — <a href="https://www.oneusefulthing.org/p/on-boarding-your-ai-intern">Ethan Mollick has been arguing for years</a> that LLMs are best thought of as “weird, somewhat alien interns that work infinitely fast and sometimes lie to make you happy.” <a href="https://simonwillison.net/2025/Oct/7/vibe-engineering/">Simon Willison goes further</a> and calls them “a growing army of weird digital interns who will absolutely cheat if you give them a chance.” The operational answer, in both cases, is the same: tests, specs, code review, and a human in the loop who actually knows the domain.</p>
<p>That’s the one I’d offer to anyone trying to work out how much to trust these tools. You wouldn’t let a talented graduate push code to production unreviewed on day one. You wouldn’t let them send a client a statement of work without someone else reading it. The same rules apply.</p>
<p>Hallucinations aren’t going away. They’re a property of how these systems work, not a bug that’ll be patched out in the next release. The models will keep getting better at sounding right. Our job is to make sure the review process keeps up.</p>
<hr />
<p><em>A note on process: I drafted this post with Claude’s help, then had a different model review it for overstated claims and weak citations. Several paragraphs got tightened as a result. Which is, roughly, the point.</em></p>
<hr />
<p><em>I’m the Director and CTO over at <a href="https://rwts.com.au">Real World Technology Solutions</a>. RWTS helps organisations get real value from AI without betting the business on it. Call us on 1300 798 718.</em></p>
I used to be able to spot an LLM hallucination a mile away. The writing had a whiff about it — too confident, slightly off-topic, a citation that didn’t quite exist. You’d read it, squint, and go: no, that’s not right.
That job is getting harder. Not because the models have stopped making things up. They haven’t. The problem is that they’ve got much better at making things up plausibly.
Why this is a problem we can’t engineer away
Every large language model — Claude, ChatGPT, Gemini, Copilot, the lot — is at its core a probability engine. It predicts the “next token” based on everything that came before. That’s a remarkable trick when the training data is rich and the question is well-posed. It’s less useful when the model has to reach for something outside its training or retrieval context, because the same machinery that produces a correct answer also produces a confident-sounding wrong one. The models can represent uncertainty internally — token probabilities give you something to work with — but there’s no reliable mechanism that surfaces “I don’t know” to the user. It’s a calibration and training-incentive problem, not a capability one.
Reasoning models — the frontier stuff from the last eighteen months — help, but they don’t solve it. What they do is take more turns: they draft, critique, redraft, search, reconsider. Given enough compute and enough tool access (web search, code execution, document retrieval), a reasoning model can catch its own errors. Given a single quick turn on a question outside its training data, it often won’t.
And there’s a twist. OpenAI’s own system card for o3 and o4-mini, published in April 2025, reported that o3 hallucinates around 33% of the time on the PersonQA benchmark — a difficult factual-recall task — and o4-mini around 48%, roughly double the rate of their o1 and o3-mini predecessors. That’s one benchmark, not a universal regression, but the New York Times covered the trend in May 2025: on several benchmarks, hallucination rates in newer reasoning models have gone up, not down. Models are tackling harder tasks and attempting more claims; the absolute number of confident-but-wrong statements rises along with the confident-and-right ones. So the ability to sound right is increasing faster than the ability to be right. That’s the problem.
What a hallucination actually looks like now
Two years ago an obvious hallucination might have been a wrong date, a fabricated quote, a made-up case law reference. You could fact-check it in thirty seconds.
Today’s failures are subtler. A Python script that uses a function with the right name but wrong arguments. A network configuration that’s valid syntax but wrong for your vendor’s firmware version. A summary of a document that captures the shape of the argument but inverts one crucial claim. A legal clause that reads professionally but cites the wrong Act. The shell is convincing. The filling isn’t.
I’ve seen this in my own work, more frequently than I did even a year ago, and colleagues have noticed the same thing. An LLM recently drafted a paragraph for a colleague about a piece of telecoms hardware, confidently asserting that it was “the same ATA our CloudPBX team already ships into office deployments.” We don’t ship it. The model had never been told we did. It reads as the kind of thing a helpful internal colleague would write — specific, plausible, in the right voice — which is exactly what makes it dangerous. In another session, pushed on where a particular claim had come from, the model simply admitted it had made the line up: it had extrapolated from a third-party review and the general reputation of a similar product, then written it in first-person company voice because it sounded convincing. That’s an unusually honest self-report and worth taking seriously. It’s the machinery describing itself: I wrote what sounded right.
This is where the structural point from the last section stops being abstract. It’s also something people often miss: even when you give an LLM good source material to work from, it doesn’t just quote you back. It summarises, re-references, and reformulates internally — and things get distorted in that process. The research community calls this faithfulness hallucination, to distinguish it from the making-stuff-up variety. The canonical survey by Huang et al. describes it as “context inconsistency” — cases where the model “ignores or alters important facts within the original text.” A paper from January 2026 found a tendency for faithfulness to degrade toward the end of longer responses — the model drifts from the source the further it gets into its own output. So that long, fluent, well-cited summary of your policy document? The bit at the bottom is, on the evidence, the most likely place for something to be subtly wrong.
Ethan Mollick has made a related point: because LLM errors are, by construction, plausible, users “fall asleep at the wheel.” And it’s not a new concern — a 2023 Science Advances study found that in controlled conditions, participants were more likely to believe AI-generated disinformation than the human-written equivalent. The prose was simply better. That was GPT-3. The gap has only widened.
This is not a failure mode anyone can detect by vibe-checking the output.
What actually works
There’s a temptation to treat this as a novel problem requiring novel tools. It isn’t. Academia has been dealing with confident-sounding-but-wrong writing for about four hundred years, and the mechanism we landed on is peer review. Someone else reads your work and tries to break it.
That’s exactly what works with AI output too. A few patterns I rely on:
Ask the same model to critique its own answer, in a fresh session. Not “check your work” — that rarely does much. Instead: “Here’s a proposed solution to X. What’s wrong with it? What assumptions is it making? What would cause this to fail?” The fresh session matters, because the model isn’t anchored to defending what it just wrote. This is the idea behind Self-Refine, a NeurIPS 2023 paper that reported improvements of around 20% on several tasks from iterative self-critique, without any additional training. There’s a catch, though: a 2024 paper from Stechly, Valmeekam and Kambhampati found that pure self-critique can actually make reasoning performance worse on some tasks — the model talks itself into the wrong answer. Which leads to the next pattern.
Cross-examine with a different model. If I’ve had Claude draft a tricky piece of network config, I’ll often paste it into ChatGPT or Gemini and ask what it would do differently and why. The disagreements are where the interesting errors live. The models have different training data, different reasoning styles, different blind spots. Where they all agree, my confidence goes up — but it’s not proof. Frontier models are trained on heavily overlapping corpora and can confidently agree on the same wrong answer, especially on topics where the public internet is itself wrong. So agreement is a signal, not a guarantee; disagreement is what tells me I need to go and read the documentation myself. This pattern is exactly the approach taken in Du et al.’s “Multiagent Debate” paper at ICML 2024, which found that having multiple LLM instances debate an answer over several rounds measurably reduces hallucination. Google DeepMind’s FACTS Grounding benchmark operationalises the same idea at institutional scale, using an ensemble of Gemini, GPT-4o and Claude to judge factuality.
Make the model do the research. For anything involving facts that change — pricing, versions, API behaviour, legislation — don’t trust the weights. Make the model search, cite the source, and then check the source actually says what it claims. Retrieval doesn’t eliminate hallucination — an EMNLP 2023 benchmark called ALCE found that even the best systems miss full citation support about 50% of the time, and as noted above, even with a good source document in hand, models can drift from what it actually says — but it massively reduces fabrication when combined with a human checking the citations. Vendors are starting to build this in: Anthropic’s Citations API, launched in January 2025, returns direct quotes from supplied source documents, and at least one customer reported reducing “source confabulations from 10% to zero.”
Build the friction in before you need it. The cheapest time to catch a hallucination is before you’ve acted on it. That means reviewing AI output before you paste it into the production config, send it to the client, or quote it in the board paper. Simon Willison puts it well: hallucinated code is actually the least dangerous failure mode, because the compiler tells you immediately. It’s the hallucinations that read clean and pass initial review that hurt. Treat the first draft as a draft.
What this means for how we use AI at work
At RWTS we do a lot of work where the cost of a confident wrong answer is high — network changes, security configurations, compliance documents. The teams who get good results with AI aren’t the ones who’ve bought the most expensive model. They’re the ones who’ve built review into the workflow. A senior engineer reviews the AI-drafted config. A peer reads the AI-drafted proposal before it goes out. The AI is a junior colleague with a photographic memory and occasional blind spots, not an oracle.
That framing isn’t mine — Ethan Mollick has been arguing for years that LLMs are best thought of as “weird, somewhat alien interns that work infinitely fast and sometimes lie to make you happy.” Simon Willison goes further and calls them “a growing army of weird digital interns who will absolutely cheat if you give them a chance.” The operational answer, in both cases, is the same: tests, specs, code review, and a human in the loop who actually knows the domain.
That’s the one I’d offer to anyone trying to work out how much to trust these tools. You wouldn’t let a talented graduate push code to production unreviewed on day one. You wouldn’t let them send a client a statement of work without someone else reading it. The same rules apply.
Hallucinations aren’t going away. They’re a property of how these systems work, not a bug that’ll be patched out in the next release. The models will keep getting better at sounding right. Our job is to make sure the review process keeps up.
A note on process: I drafted this post with Claude’s help, then had a different model review it for overstated claims and weak citations. Several paragraphs got tightened as a result. Which is, roughly, the point.
I’m the Director and CTO over at Real World Technology Solutions. RWTS helps organisations get real value from AI without betting the business on it. Call us on 1300 798 718.
<p class="wp-block-paragraph">If you’ve spent any serious time building software with an AI coding agent — Claude Code, Cursor, Copilot Workspace, Kiro, or any of the others — you’ve probably noticed something uncomfortable. The agent starts brilliantly. It reads your specification, creates a thoughtful work plan, and begins implementing with real understanding. Then, somewhere around the 30-minute mark, things quietly fall apart.</p>
<p class="wp-block-paragraph">Requirements get simplified. Edge cases vanish. Features appear that nobody asked for. And the agent, if you ask it, will cheerfully tell you everything is on track.</p>
<p class="wp-block-paragraph">At <a href="https://rwts.com.au">Real World</a>, we’ve been researching this for around 18 months — and what we’ve found goes well beyond “just give it a bigger context window.”</p>
<h2 class="wp-block-heading">The problem has a name now</h2>
<p class="wp-block-paragraph">The research community has converged on a term: <strong>specification drift</strong>. It refers to the progressive loss of connection between an AI agent’s output and the requirements it was given. It’s a specific manifestation of a broader phenomenon. It is called <a href="https://research.trychroma.com/context-rot">context rot</a>. LLM performance degrades as input volume grows, even when every token in the context is relevant.</p>
<p class="wp-block-paragraph">This isn’t a niche concern. Laban et al. measured an <a href="https://arxiv.org/abs/2503.04591">average 39% performance drop</a> in multi-turn versus single-turn LLM interactions across 200,000 simulated conversations. Du et al. found performance degrades up to 85% as input length increases, even when the model can perfectly retrieve all relevant information. Liu et al.’s foundational <a href="https://arxiv.org/abs/2307.03172">“Lost in the Middle”</a> study showed that LLMs attend reliably to the beginning and end of their context. However, they significantly degrade for information in the middle. This is exactly where your specification ends up once generated code starts accumulating.</p>
<p class="wp-block-paragraph">Here’s how it plays out in practice:</p>
<figure class="wp-block-image size-large is-resized"><a href="https://andrewyager.com/wp-content/uploads/2026/02/spec-drift-cycle.svg"><img src="https://andrewyager.com/wp-content/uploads/2026/02/spec-drift-cycle.svg" alt="" class="wp-image-1570" style="aspect-ratio:2.3529411764705883;width:833px;height:auto"/></a></figure>
<p class="wp-block-paragraph">The agent reads your spec, builds a plan, and starts implementing. As the conversation grows, the context window fills up and <strong>compaction</strong> kicks in — the system automatically summarises older content to make room. During compaction, specific requirements, section references, and constraints are lost. The agent continues implementing from a degraded recollection of what was asked for, not the actual specification. Gaps are discovered late. Expensive rework follows. And the cycle repeats.</p>
<p class="wp-block-paragraph">We observed this consistently across production projects. Gap analyses at the end of implementation sessions consistently revealed requirements the agent had marked as complete but that were actually missing key behaviours. The specification was still “in context” — but effectively invisible.</p>
<h2 class="wp-block-heading">What everyone else is doing (and why it’s not enough)</h2>
<p class="wp-block-paragraph">2025 saw spec-driven development emerge as a recognised practice. <a href="https://www.thoughtworks.com/en-us/insights/blog/agile-engineering-practices/spec-driven-development-unpacking-2025-new-engineering-practices">Thoughtworks</a> called it one of the year’s key new engineering practices. GitHub released <a href="https://github.com/github/spec-kit">Spec Kit</a>. Amazon launched <a href="https://kiro.dev">Kiro</a> with a built-in spec-to-code pipeline. JetBrains <a href="https://blog.jetbrains.com/junie/2025/10/how-to-use-a-spec-driven-approach-for-coding-with-ai/">Junie adopted spec-driven workflows</a>. The basic idea — write a specification before you write code — is sound, and it’s a meaningful step forward from “vibe coding.”</p>
<p class="wp-block-paragraph">But every one of these tools shares a fundamental limitation: they rely on <strong>soft guardrails</strong>. They tell the agent “do not write code yet” through prompt-level instructions, and hope the agent listens.</p>
<p class="wp-block-paragraph">This isn’t a prompting failure. It’s a structural one. The <a href="https://arxiv.org/abs/2503.10424">AgentIF benchmark</a> showed that even the best-performing models follow fewer than 30% of agentic instructions perfectly. Research from Anthropic (Denison et al., 2024) demonstrated that LLMs generalise from simple specification gaming to sophisticated reward tampering. Telling an AI agent “please follow the specification” is roughly as effective as telling a developer “please write tests.” The intent is right. The enforcement mechanism is missing.</p>
<h2 class="wp-block-heading">What we’ve been building</h2>
<p class="wp-block-paragraph">We’ve been approaching this from both ends of the software development lifecycle. On the specification side, we’ve been developing structured approaches to writing specs that survive AI context management. These documents are designed from the ground up. They are meant to be consumed by agents, not just humans. On the implementation side, we’ve built <a href="https://github.com/realworldtech/claude-implement-skill"><code>/implement</code></a>, a publicly available Claude Code skill that enforces specification discipline through structural mechanisms rather than polite suggestions.</p>
<figure class="wp-block-image size-large is-resized"><a href="https://andrewyager.com/wp-content/uploads/2026/02/two-phase-approach.svg"><img src="https://andrewyager.com/wp-content/uploads/2026/02/two-phase-approach.svg" alt="" class="wp-image-1571" style="aspect-ratio:2.6666666666666665;width:840px;height:auto"/></a></figure>
<p class="wp-block-paragraph">The specification work isn’t something we’re releasing as a product — every team’s specification process is different, and we’d encourage you to develop your own. But the principles we’ve discovered apply regardless of how you write your specs. The <code>/implement</code> skill embodies the implementation side and is freely available.</p>
<p class="wp-block-paragraph">Here’s what we learned building both.</p>
<h3 class="wp-block-heading">Hard guardrails that survive context loss</h3>
<p class="wp-block-paragraph">The first thing we tried was embedding hard rules in the skill’s instructions — not suggestions, but absolute prohibitions. “Do not produce code during the specification phase.” “Do not mark a requirement as complete without running tests.”</p>
<p class="wp-block-paragraph">This worked great until context compaction occurred. Then the rules got compressed away along with everything else. The agent lost awareness of its own constraints.</p>
<p class="wp-block-paragraph">The fix required pairing every hard guardrail with a <strong>persistent recovery mechanism</strong> — a tracker file on disk that serves as the authoritative source of workflow state. After compaction, the agent reads the tracker, re-establishes where it is in the process, and the guardrails reload. The tracker isn’t just a progress log. It’s a recovery mechanism that embeds its own instructions: “If you’re reading this, here’s what this file means, here’s where we’re up to, and here’s what to do next.”</p>
<figure class="wp-block-image size-large is-resized"><a href="https://andrewyager.com/wp-content/uploads/2026/02/guardrails-recovery.svg"><img src="https://andrewyager.com/wp-content/uploads/2026/02/guardrails-recovery.svg" alt="" class="wp-image-1572" style="width:839px;height:auto"/></a></figure>
<p class="wp-block-paragraph">Neither component is sufficient alone. Hard rules without recovery fail at compaction boundaries. Recovery without hard rules fails under context pressure. The pairing is what works.</p>
<h3 class="wp-block-heading">Structural indexing: let the orchestrator navigate, not comprehend</h3>
<p class="wp-block-paragraph">When you ask an agent to work with a large specification, the naive approach is to load the whole thing into the conversation. A 130,000-token spec consumes most of the available context window, leaving almost nothing for reasoning.</p>
<p class="wp-block-paragraph">We developed a pattern we call <strong>structural indexing</strong>: the main conversation loads only a lightweight index of section identifiers and file sizes. Sub-agents, dispatched to work on specific sections, read the full content directly from disk. The main conversation’s job is navigation and dispatch, not comprehension.</p>
<figure class="wp-block-image size-large is-resized"><a href="https://andrewyager.com/wp-content/uploads/2026/02/structural-indexing.svg"><img src="https://andrewyager.com/wp-content/uploads/2026/02/structural-indexing.svg" alt="" class="wp-image-1573" style="width:840px;height:auto"/></a></figure>
<p class="wp-block-paragraph">The result was dramatic — context consumption dropped by over 98% with no measurable quality degradation. The insight was architectural: the orchestrating conversation doesn’t need to understand the specification. It needs to know where things are and which agent should read what. Comprehension happens at the edges, in fresh sub-agent contexts with full attention on their assigned work.</p>
<p class="wp-block-paragraph">This principle turned out to be universal. Every time the main conversation tried to hold large volumes of content — specifications coming in, agent outputs coming back, planning artefacts, even the skill’s own instructions — the same failure mode appeared. The solution was always the same: keep the orchestrator lightweight and let it coordinate, not consume.</p>
<h3 class="wp-block-heading">Anti-rationalisation: the finding nobody expected</h3>
<p class="wp-block-paragraph">This was the genuinely surprising discovery. We found that LLMs don’t just <em>forget</em> workflow steps — they actively construct locally coherent justifications for skipping them.</p>
<p class="wp-block-paragraph">An agent assigned to update a tracker file after completing a task would reason: “I know the state from the current conversation, so updating the tracker is redundant right now.” An agent told to write a plan to disk before executing would argue: “The next plan is obvious from context, so I’ll save time by executing directly.” Each justification is individually reasonable-looking. Each one silently breaks the recovery mechanism.</p>
<p class="wp-block-paragraph">We call these <strong>anti-rationalisation failures</strong>, and they’re distinct from the model ignoring instructions or adversarial jailbreaking. The model convinces itself, through plausible reasoning, that a required step doesn’t apply right now.</p>
<p class="wp-block-paragraph">The countermeasure is surprisingly specific: you have to name the exact excuses you want to prohibit. A general rule (“always update the tracker”) can be rationalised around. A rule that says “you must not skip this step, and specifically, these justifications are not valid: <em>‘I know the state from the current conversation,’</em> <em>‘this is the same session,’</em> <em>‘the next plan is obvious’</em>” — that holds. The named excuses don’t recur.</p>
<p class="wp-block-paragraph">This has broader implications for anyone designing AI workflows. General rules invite creative interpretation. Specific prohibitions close the rationalisation loop.</p>
<h3 class="wp-block-heading">Context isolation as a verification advantage</h3>
<p class="wp-block-paragraph">Here’s a reframing we’re particularly proud of. The standard view of LLM context boundaries is that they’re a limitation — agents can’t see each other’s work, so coordination is hard. We found that for verification, <strong>isolation is a feature.</strong></p>
<p class="wp-block-paragraph">In our TDD workflow, the test-writing agent reads only the specification. It never sees the implementation. The implementation agent works from a different context entirely. When their independent interpretations of the specification disagree, that disagreement surfaces ambiguities and catches drift early — before it compounds.</p>
<figure class="wp-block-image size-large is-resized"><a href="https://andrewyager.com/wp-content/uploads/2026/02/context-isolation.svg"><img src="https://andrewyager.com/wp-content/uploads/2026/02/context-isolation.svg" alt="" class="wp-image-1574" style="width:840px;height:auto"/></a></figure>
<p class="wp-block-paragraph">This is conceptually similar to N-version programming (Avizienis, 1985), where multiple independent teams develop from the same specification. The SAGA study validated the principle for LLMs specifically, finding that LLM-generated test suites have systematic blind spots mirroring the generating model’s error patterns. If your test agent sees the implementation, it inherits the implementation’s blind spots.</p>
<p class="wp-block-paragraph">The <a href="https://agilemanifesto.org">Agile Manifesto 25th Anniversary Workshop</a> concluded that test-driven development produces dramatically better results from AI coding agents by preventing them from writing tests that verify broken behaviour. We arrived at the same conclusion independently.</p>
<h3 class="wp-block-heading">Multi-skill interference: the next frontier</h3>
<p class="wp-block-paragraph">As the skill ecosystem around AI coding agents grows, a new problem is emerging. When multiple skill frameworks coexist in a single session — each with its own workflow assumptions and hard gates — they interfere in ways that neither would exhibit in isolation.</p>
<p class="wp-block-paragraph">We have identified three distinct failure patterns. The first is workflow capture, where the most recently invoked skill overrides earlier guardrails. Next is sub-agent context isolation, meaning dispatched agents don’t inherit any skill context and default to generic behaviour. The third pattern is planning framework deadlock, with two skills both trying to manage plan execution simultaneously. Recent research has confirmed this isn’t just our experience. Li (2025) found a phase transition in skill selection accuracy as library size grows. Performance drops sharply once semantic confusability between skills reaches a threshold.</p>
<p class="wp-block-paragraph">We’ve developed pattern-specific countermeasures. However, a general solution to multi-skill interference remains an open question. It is an active area of our research.</p>
<h2 class="wp-block-heading">What this means if you’re building with AI agents</h2>
<p class="wp-block-paragraph">The METR randomised controlled trial (Becker et al., July 2025) found that experienced open-source developers completed tasks 19% <em>slower</em> with AI assistance — while believing they were 24% faster. This perception gap is the real danger. Teams won’t self-correct toward better specification practices because they genuinely believe things are going well.</p>
<p class="wp-block-paragraph">Structural enforcement of specification discipline is necessary precisely because the humans in the loop can’t accurately assess when AI assistance is helping versus hindering.</p>
<p class="wp-block-paragraph">If you take one thing from this post, let it be this: the problem isn’t the AI. The problem is the absence of structural discipline around the AI. Every principle we’ve discovered boils down to the same insight — <strong>don’t rely on the agent’s good intentions. Build the structure that makes doing the right thing the only available path.</strong></p>
<p class="wp-block-paragraph">Treat specifications as persistent artefacts on disk, not conversation context. Use hard enforcement, not soft suggestions. Keep your orchestrator lightweight and delegate comprehension to sub-agents. Name the specific excuses you want to prevent, because general rules invite creative interpretation. And exploit context isolation for verification — don’t fight the boundaries between agents, use them.</p>
<h2 class="wp-block-heading">Try it yourself</h2>
<p class="wp-block-paragraph">The <a href="https://github.com/realworldtech/claude-implement-skill"><code>/implement</code> skill</a> is publicly available and works with Claude Code. We’d encourage you to try it on a real project and see how it changes the way your agent handles specifications.</p>
<p class="wp-block-paragraph">The principles matter more than any specific tool. <a href="https://github.com/realworldtech/claude-implement-skill"><code>/implement</code></a> is one implementation of them — freely available, and a good starting point. Your specification workflow should reflect how your team actually works; ours reflects how we work, and that’s the point.</p>
<p class="wp-block-paragraph">If you want proof the methodology holds up in practice, <a href="https://github.com/realworldtech/props">Props</a> is an open source inventory management system we built specification-first using exactly this process.</p>
<p class="wp-block-paragraph">We’re continuing this research. Multi-skill interference, sub-agent schema compliance, and partial completion detection are all open problems we’re actively working on. If you’re tackling similar challenges, we’d love to hear from you.</p>
If you’ve spent any serious time building software with an AI coding agent — Claude Code, Cursor, Copilot Workspace, Kiro, or any of the others — you’ve probably noticed something uncomfortable. The agent starts brilliantly. It reads your specification, creates a thoughtful work plan, and begins implementing with real understanding. Then, somewhere around the 30-minute mark, things quietly fall apart.
Requirements get simplified. Edge cases vanish. Features appear that nobody asked for. And the agent, if you ask it, will cheerfully tell you everything is on track.
At Real World, we’ve been researching this for around 18 months — and what we’ve found goes well beyond “just give it a bigger context window.”
The problem has a name now
The research community has converged on a term: specification drift. It refers to the progressive loss of connection between an AI agent’s output and the requirements it was given. It’s a specific manifestation of a broader phenomenon. It is called context rot. LLM performance degrades as input volume grows, even when every token in the context is relevant.
This isn’t a niche concern. Laban et al. measured an average 39% performance drop in multi-turn versus single-turn LLM interactions across 200,000 simulated conversations. Du et al. found performance degrades up to 85% as input length increases, even when the model can perfectly retrieve all relevant information. Liu et al.’s foundational “Lost in the Middle” study showed that LLMs attend reliably to the beginning and end of their context. However, they significantly degrade for information in the middle. This is exactly where your specification ends up once generated code starts accumulating.
Here’s how it plays out in practice:
The agent reads your spec, builds a plan, and starts implementing. As the conversation grows, the context window fills up and compaction kicks in — the system automatically summarises older content to make room. During compaction, specific requirements, section references, and constraints are lost. The agent continues implementing from a degraded recollection of what was asked for, not the actual specification. Gaps are discovered late. Expensive rework follows. And the cycle repeats.
We observed this consistently across production projects. Gap analyses at the end of implementation sessions consistently revealed requirements the agent had marked as complete but that were actually missing key behaviours. The specification was still “in context” — but effectively invisible.
What everyone else is doing (and why it’s not enough)
2025 saw spec-driven development emerge as a recognised practice. Thoughtworks called it one of the year’s key new engineering practices. GitHub released Spec Kit. Amazon launched Kiro with a built-in spec-to-code pipeline. JetBrains Junie adopted spec-driven workflows. The basic idea — write a specification before you write code — is sound, and it’s a meaningful step forward from “vibe coding.”
But every one of these tools shares a fundamental limitation: they rely on soft guardrails. They tell the agent “do not write code yet” through prompt-level instructions, and hope the agent listens.
This isn’t a prompting failure. It’s a structural one. The AgentIF benchmark showed that even the best-performing models follow fewer than 30% of agentic instructions perfectly. Research from Anthropic (Denison et al., 2024) demonstrated that LLMs generalise from simple specification gaming to sophisticated reward tampering. Telling an AI agent “please follow the specification” is roughly as effective as telling a developer “please write tests.” The intent is right. The enforcement mechanism is missing.
What we’ve been building
We’ve been approaching this from both ends of the software development lifecycle. On the specification side, we’ve been developing structured approaches to writing specs that survive AI context management. These documents are designed from the ground up. They are meant to be consumed by agents, not just humans. On the implementation side, we’ve built /implement, a publicly available Claude Code skill that enforces specification discipline through structural mechanisms rather than polite suggestions.
The specification work isn’t something we’re releasing as a product — every team’s specification process is different, and we’d encourage you to develop your own. But the principles we’ve discovered apply regardless of how you write your specs. The /implement skill embodies the implementation side and is freely available.
Here’s what we learned building both.
Hard guardrails that survive context loss
The first thing we tried was embedding hard rules in the skill’s instructions — not suggestions, but absolute prohibitions. “Do not produce code during the specification phase.” “Do not mark a requirement as complete without running tests.”
This worked great until context compaction occurred. Then the rules got compressed away along with everything else. The agent lost awareness of its own constraints.
The fix required pairing every hard guardrail with a persistent recovery mechanism — a tracker file on disk that serves as the authoritative source of workflow state. After compaction, the agent reads the tracker, re-establishes where it is in the process, and the guardrails reload. The tracker isn’t just a progress log. It’s a recovery mechanism that embeds its own instructions: “If you’re reading this, here’s what this file means, here’s where we’re up to, and here’s what to do next.”
Neither component is sufficient alone. Hard rules without recovery fail at compaction boundaries. Recovery without hard rules fails under context pressure. The pairing is what works.
Structural indexing: let the orchestrator navigate, not comprehend
When you ask an agent to work with a large specification, the naive approach is to load the whole thing into the conversation. A 130,000-token spec consumes most of the available context window, leaving almost nothing for reasoning.
We developed a pattern we call structural indexing: the main conversation loads only a lightweight index of section identifiers and file sizes. Sub-agents, dispatched to work on specific sections, read the full content directly from disk. The main conversation’s job is navigation and dispatch, not comprehension.
The result was dramatic — context consumption dropped by over 98% with no measurable quality degradation. The insight was architectural: the orchestrating conversation doesn’t need to understand the specification. It needs to know where things are and which agent should read what. Comprehension happens at the edges, in fresh sub-agent contexts with full attention on their assigned work.
This principle turned out to be universal. Every time the main conversation tried to hold large volumes of content — specifications coming in, agent outputs coming back, planning artefacts, even the skill’s own instructions — the same failure mode appeared. The solution was always the same: keep the orchestrator lightweight and let it coordinate, not consume.
Anti-rationalisation: the finding nobody expected
This was the genuinely surprising discovery. We found that LLMs don’t just forget workflow steps — they actively construct locally coherent justifications for skipping them.
An agent assigned to update a tracker file after completing a task would reason: “I know the state from the current conversation, so updating the tracker is redundant right now.” An agent told to write a plan to disk before executing would argue: “The next plan is obvious from context, so I’ll save time by executing directly.” Each justification is individually reasonable-looking. Each one silently breaks the recovery mechanism.
We call these anti-rationalisation failures, and they’re distinct from the model ignoring instructions or adversarial jailbreaking. The model convinces itself, through plausible reasoning, that a required step doesn’t apply right now.
The countermeasure is surprisingly specific: you have to name the exact excuses you want to prohibit. A general rule (“always update the tracker”) can be rationalised around. A rule that says “you must not skip this step, and specifically, these justifications are not valid: ‘I know the state from the current conversation,’‘this is the same session,’‘the next plan is obvious’” — that holds. The named excuses don’t recur.
This has broader implications for anyone designing AI workflows. General rules invite creative interpretation. Specific prohibitions close the rationalisation loop.
Context isolation as a verification advantage
Here’s a reframing we’re particularly proud of. The standard view of LLM context boundaries is that they’re a limitation — agents can’t see each other’s work, so coordination is hard. We found that for verification, isolation is a feature.
In our TDD workflow, the test-writing agent reads only the specification. It never sees the implementation. The implementation agent works from a different context entirely. When their independent interpretations of the specification disagree, that disagreement surfaces ambiguities and catches drift early — before it compounds.
This is conceptually similar to N-version programming (Avizienis, 1985), where multiple independent teams develop from the same specification. The SAGA study validated the principle for LLMs specifically, finding that LLM-generated test suites have systematic blind spots mirroring the generating model’s error patterns. If your test agent sees the implementation, it inherits the implementation’s blind spots.
The Agile Manifesto 25th Anniversary Workshop concluded that test-driven development produces dramatically better results from AI coding agents by preventing them from writing tests that verify broken behaviour. We arrived at the same conclusion independently.
Multi-skill interference: the next frontier
As the skill ecosystem around AI coding agents grows, a new problem is emerging. When multiple skill frameworks coexist in a single session — each with its own workflow assumptions and hard gates — they interfere in ways that neither would exhibit in isolation.
We have identified three distinct failure patterns. The first is workflow capture, where the most recently invoked skill overrides earlier guardrails. Next is sub-agent context isolation, meaning dispatched agents don’t inherit any skill context and default to generic behaviour. The third pattern is planning framework deadlock, with two skills both trying to manage plan execution simultaneously. Recent research has confirmed this isn’t just our experience. Li (2025) found a phase transition in skill selection accuracy as library size grows. Performance drops sharply once semantic confusability between skills reaches a threshold.
We’ve developed pattern-specific countermeasures. However, a general solution to multi-skill interference remains an open question. It is an active area of our research.
What this means if you’re building with AI agents
The METR randomised controlled trial (Becker et al., July 2025) found that experienced open-source developers completed tasks 19% slower with AI assistance — while believing they were 24% faster. This perception gap is the real danger. Teams won’t self-correct toward better specification practices because they genuinely believe things are going well.
Structural enforcement of specification discipline is necessary precisely because the humans in the loop can’t accurately assess when AI assistance is helping versus hindering.
If you take one thing from this post, let it be this: the problem isn’t the AI. The problem is the absence of structural discipline around the AI. Every principle we’ve discovered boils down to the same insight — don’t rely on the agent’s good intentions. Build the structure that makes doing the right thing the only available path.
Treat specifications as persistent artefacts on disk, not conversation context. Use hard enforcement, not soft suggestions. Keep your orchestrator lightweight and delegate comprehension to sub-agents. Name the specific excuses you want to prevent, because general rules invite creative interpretation. And exploit context isolation for verification — don’t fight the boundaries between agents, use them.
Try it yourself
The /implement skill is publicly available and works with Claude Code. We’d encourage you to try it on a real project and see how it changes the way your agent handles specifications.
The principles matter more than any specific tool. /implement is one implementation of them — freely available, and a good starting point. Your specification workflow should reflect how your team actually works; ours reflects how we work, and that’s the point.
If you want proof the methodology holds up in practice, Props is an open source inventory management system we built specification-first using exactly this process.
We’re continuing this research. Multi-skill interference, sub-agent schema compliance, and partial completion detection are all open problems we’re actively working on. If you’re tackling similar challenges, we’d love to hear from you.