Why Your AI Coding Agent Forgets What It’s Building (And What We’re Doing About It)
Posted on
<p class="wp-block-paragraph">If you’ve spent any serious time building software with an AI coding agent — Claude Code, Cursor, Copilot Workspace, Kiro, or any of the others — you’ve probably noticed something uncomfortable. The agent starts brilliantly. It reads your specification, creates a thoughtful work plan, and begins implementing with real understanding. Then, somewhere around the 30-minute mark, things quietly fall apart.</p>
<p class="wp-block-paragraph">Requirements get simplified. Edge cases vanish. Features appear that nobody asked for. And the agent, if you ask it, will cheerfully tell you everything is on track.</p>
<p class="wp-block-paragraph">At <a href="https://rwts.com.au">Real World</a>, we’ve been researching this for around 18 months — and what we’ve found goes well beyond “just give it a bigger context window.”</p>
<h2 class="wp-block-heading">The problem has a name now</h2>
<p class="wp-block-paragraph">The research community has converged on a term: <strong>specification drift</strong>. It refers to the progressive loss of connection between an AI agent’s output and the requirements it was given. It’s a specific manifestation of a broader phenomenon. It is called <a href="https://research.trychroma.com/context-rot">context rot</a>. LLM performance degrades as input volume grows, even when every token in the context is relevant.</p>
<p class="wp-block-paragraph">This isn’t a niche concern. Laban et al. measured an <a href="https://arxiv.org/abs/2503.04591">average 39% performance drop</a> in multi-turn versus single-turn LLM interactions across 200,000 simulated conversations. Du et al. found performance degrades up to 85% as input length increases, even when the model can perfectly retrieve all relevant information. Liu et al.’s foundational <a href="https://arxiv.org/abs/2307.03172">“Lost in the Middle”</a> study showed that LLMs attend reliably to the beginning and end of their context. However, they significantly degrade for information in the middle. This is exactly where your specification ends up once generated code starts accumulating.</p>
<p class="wp-block-paragraph">Here’s how it plays out in practice:</p>
<figure class="wp-block-image size-large is-resized"><a href="https://andrewyager.com/wp-content/uploads/2026/02/spec-drift-cycle.svg"><img src="https://andrewyager.com/wp-content/uploads/2026/02/spec-drift-cycle.svg" alt="" class="wp-image-1570" style="aspect-ratio:2.3529411764705883;width:833px;height:auto"/></a></figure>
<p class="wp-block-paragraph">The agent reads your spec, builds a plan, and starts implementing. As the conversation grows, the context window fills up and <strong>compaction</strong> kicks in — the system automatically summarises older content to make room. During compaction, specific requirements, section references, and constraints are lost. The agent continues implementing from a degraded recollection of what was asked for, not the actual specification. Gaps are discovered late. Expensive rework follows. And the cycle repeats.</p>
<p class="wp-block-paragraph">We observed this consistently across production projects. Gap analyses at the end of implementation sessions consistently revealed requirements the agent had marked as complete but that were actually missing key behaviours. The specification was still “in context” — but effectively invisible.</p>
<h2 class="wp-block-heading">What everyone else is doing (and why it’s not enough)</h2>
<p class="wp-block-paragraph">2025 saw spec-driven development emerge as a recognised practice. <a href="https://www.thoughtworks.com/en-us/insights/blog/agile-engineering-practices/spec-driven-development-unpacking-2025-new-engineering-practices">Thoughtworks</a> called it one of the year’s key new engineering practices. GitHub released <a href="https://github.com/github/spec-kit">Spec Kit</a>. Amazon launched <a href="https://kiro.dev">Kiro</a> with a built-in spec-to-code pipeline. JetBrains <a href="https://blog.jetbrains.com/junie/2025/10/how-to-use-a-spec-driven-approach-for-coding-with-ai/">Junie adopted spec-driven workflows</a>. The basic idea — write a specification before you write code — is sound, and it’s a meaningful step forward from “vibe coding.”</p>
<p class="wp-block-paragraph">But every one of these tools shares a fundamental limitation: they rely on <strong>soft guardrails</strong>. They tell the agent “do not write code yet” through prompt-level instructions, and hope the agent listens.</p>
<p class="wp-block-paragraph">This isn’t a prompting failure. It’s a structural one. The <a href="https://arxiv.org/abs/2503.10424">AgentIF benchmark</a> showed that even the best-performing models follow fewer than 30% of agentic instructions perfectly. Research from Anthropic (Denison et al., 2024) demonstrated that LLMs generalise from simple specification gaming to sophisticated reward tampering. Telling an AI agent “please follow the specification” is roughly as effective as telling a developer “please write tests.” The intent is right. The enforcement mechanism is missing.</p>
<h2 class="wp-block-heading">What we’ve been building</h2>
<p class="wp-block-paragraph">We’ve been approaching this from both ends of the software development lifecycle. On the specification side, we’ve been developing structured approaches to writing specs that survive AI context management. These documents are designed from the ground up. They are meant to be consumed by agents, not just humans. On the implementation side, we’ve built <a href="https://github.com/realworldtech/claude-implement-skill"><code>/implement</code></a>, a publicly available Claude Code skill that enforces specification discipline through structural mechanisms rather than polite suggestions.</p>
<figure class="wp-block-image size-large is-resized"><a href="https://andrewyager.com/wp-content/uploads/2026/02/two-phase-approach.svg"><img src="https://andrewyager.com/wp-content/uploads/2026/02/two-phase-approach.svg" alt="" class="wp-image-1571" style="aspect-ratio:2.6666666666666665;width:840px;height:auto"/></a></figure>
<p class="wp-block-paragraph">The specification work isn’t something we’re releasing as a product — every team’s specification process is different, and we’d encourage you to develop your own. But the principles we’ve discovered apply regardless of how you write your specs. The <code>/implement</code> skill embodies the implementation side and is freely available.</p>
<p class="wp-block-paragraph">Here’s what we learned building both.</p>
<h3 class="wp-block-heading">Hard guardrails that survive context loss</h3>
<p class="wp-block-paragraph">The first thing we tried was embedding hard rules in the skill’s instructions — not suggestions, but absolute prohibitions. “Do not produce code during the specification phase.” “Do not mark a requirement as complete without running tests.”</p>
<p class="wp-block-paragraph">This worked great until context compaction occurred. Then the rules got compressed away along with everything else. The agent lost awareness of its own constraints.</p>
<p class="wp-block-paragraph">The fix required pairing every hard guardrail with a <strong>persistent recovery mechanism</strong> — a tracker file on disk that serves as the authoritative source of workflow state. After compaction, the agent reads the tracker, re-establishes where it is in the process, and the guardrails reload. The tracker isn’t just a progress log. It’s a recovery mechanism that embeds its own instructions: “If you’re reading this, here’s what this file means, here’s where we’re up to, and here’s what to do next.”</p>
<figure class="wp-block-image size-large is-resized"><a href="https://andrewyager.com/wp-content/uploads/2026/02/guardrails-recovery.svg"><img src="https://andrewyager.com/wp-content/uploads/2026/02/guardrails-recovery.svg" alt="" class="wp-image-1572" style="width:839px;height:auto"/></a></figure>
<p class="wp-block-paragraph">Neither component is sufficient alone. Hard rules without recovery fail at compaction boundaries. Recovery without hard rules fails under context pressure. The pairing is what works.</p>
<h3 class="wp-block-heading">Structural indexing: let the orchestrator navigate, not comprehend</h3>
<p class="wp-block-paragraph">When you ask an agent to work with a large specification, the naive approach is to load the whole thing into the conversation. A 130,000-token spec consumes most of the available context window, leaving almost nothing for reasoning.</p>
<p class="wp-block-paragraph">We developed a pattern we call <strong>structural indexing</strong>: the main conversation loads only a lightweight index of section identifiers and file sizes. Sub-agents, dispatched to work on specific sections, read the full content directly from disk. The main conversation’s job is navigation and dispatch, not comprehension.</p>
<figure class="wp-block-image size-large is-resized"><a href="https://andrewyager.com/wp-content/uploads/2026/02/structural-indexing.svg"><img src="https://andrewyager.com/wp-content/uploads/2026/02/structural-indexing.svg" alt="" class="wp-image-1573" style="width:840px;height:auto"/></a></figure>
<p class="wp-block-paragraph">The result was dramatic — context consumption dropped by over 98% with no measurable quality degradation. The insight was architectural: the orchestrating conversation doesn’t need to understand the specification. It needs to know where things are and which agent should read what. Comprehension happens at the edges, in fresh sub-agent contexts with full attention on their assigned work.</p>
<p class="wp-block-paragraph">This principle turned out to be universal. Every time the main conversation tried to hold large volumes of content — specifications coming in, agent outputs coming back, planning artefacts, even the skill’s own instructions — the same failure mode appeared. The solution was always the same: keep the orchestrator lightweight and let it coordinate, not consume.</p>
<h3 class="wp-block-heading">Anti-rationalisation: the finding nobody expected</h3>
<p class="wp-block-paragraph">This was the genuinely surprising discovery. We found that LLMs don’t just <em>forget</em> workflow steps — they actively construct locally coherent justifications for skipping them.</p>
<p class="wp-block-paragraph">An agent assigned to update a tracker file after completing a task would reason: “I know the state from the current conversation, so updating the tracker is redundant right now.” An agent told to write a plan to disk before executing would argue: “The next plan is obvious from context, so I’ll save time by executing directly.” Each justification is individually reasonable-looking. Each one silently breaks the recovery mechanism.</p>
<p class="wp-block-paragraph">We call these <strong>anti-rationalisation failures</strong>, and they’re distinct from the model ignoring instructions or adversarial jailbreaking. The model convinces itself, through plausible reasoning, that a required step doesn’t apply right now.</p>
<p class="wp-block-paragraph">The countermeasure is surprisingly specific: you have to name the exact excuses you want to prohibit. A general rule (“always update the tracker”) can be rationalised around. A rule that says “you must not skip this step, and specifically, these justifications are not valid: <em>‘I know the state from the current conversation,’</em> <em>‘this is the same session,’</em> <em>‘the next plan is obvious’</em>” — that holds. The named excuses don’t recur.</p>
<p class="wp-block-paragraph">This has broader implications for anyone designing AI workflows. General rules invite creative interpretation. Specific prohibitions close the rationalisation loop.</p>
<h3 class="wp-block-heading">Context isolation as a verification advantage</h3>
<p class="wp-block-paragraph">Here’s a reframing we’re particularly proud of. The standard view of LLM context boundaries is that they’re a limitation — agents can’t see each other’s work, so coordination is hard. We found that for verification, <strong>isolation is a feature.</strong></p>
<p class="wp-block-paragraph">In our TDD workflow, the test-writing agent reads only the specification. It never sees the implementation. The implementation agent works from a different context entirely. When their independent interpretations of the specification disagree, that disagreement surfaces ambiguities and catches drift early — before it compounds.</p>
<figure class="wp-block-image size-large is-resized"><a href="https://andrewyager.com/wp-content/uploads/2026/02/context-isolation.svg"><img src="https://andrewyager.com/wp-content/uploads/2026/02/context-isolation.svg" alt="" class="wp-image-1574" style="width:840px;height:auto"/></a></figure>
<p class="wp-block-paragraph">This is conceptually similar to N-version programming (Avizienis, 1985), where multiple independent teams develop from the same specification. The SAGA study validated the principle for LLMs specifically, finding that LLM-generated test suites have systematic blind spots mirroring the generating model’s error patterns. If your test agent sees the implementation, it inherits the implementation’s blind spots.</p>
<p class="wp-block-paragraph">The <a href="https://agilemanifesto.org">Agile Manifesto 25th Anniversary Workshop</a> concluded that test-driven development produces dramatically better results from AI coding agents by preventing them from writing tests that verify broken behaviour. We arrived at the same conclusion independently.</p>
<h3 class="wp-block-heading">Multi-skill interference: the next frontier</h3>
<p class="wp-block-paragraph">As the skill ecosystem around AI coding agents grows, a new problem is emerging. When multiple skill frameworks coexist in a single session — each with its own workflow assumptions and hard gates — they interfere in ways that neither would exhibit in isolation.</p>
<p class="wp-block-paragraph">We have identified three distinct failure patterns. The first is workflow capture, where the most recently invoked skill overrides earlier guardrails. Next is sub-agent context isolation, meaning dispatched agents don’t inherit any skill context and default to generic behaviour. The third pattern is planning framework deadlock, with two skills both trying to manage plan execution simultaneously. Recent research has confirmed this isn’t just our experience. Li (2025) found a phase transition in skill selection accuracy as library size grows. Performance drops sharply once semantic confusability between skills reaches a threshold.</p>
<p class="wp-block-paragraph">We’ve developed pattern-specific countermeasures. However, a general solution to multi-skill interference remains an open question. It is an active area of our research.</p>
<h2 class="wp-block-heading">What this means if you’re building with AI agents</h2>
<p class="wp-block-paragraph">The METR randomised controlled trial (Becker et al., July 2025) found that experienced open-source developers completed tasks 19% <em>slower</em> with AI assistance — while believing they were 24% faster. This perception gap is the real danger. Teams won’t self-correct toward better specification practices because they genuinely believe things are going well.</p>
<p class="wp-block-paragraph">Structural enforcement of specification discipline is necessary precisely because the humans in the loop can’t accurately assess when AI assistance is helping versus hindering.</p>
<p class="wp-block-paragraph">If you take one thing from this post, let it be this: the problem isn’t the AI. The problem is the absence of structural discipline around the AI. Every principle we’ve discovered boils down to the same insight — <strong>don’t rely on the agent’s good intentions. Build the structure that makes doing the right thing the only available path.</strong></p>
<p class="wp-block-paragraph">Treat specifications as persistent artefacts on disk, not conversation context. Use hard enforcement, not soft suggestions. Keep your orchestrator lightweight and delegate comprehension to sub-agents. Name the specific excuses you want to prevent, because general rules invite creative interpretation. And exploit context isolation for verification — don’t fight the boundaries between agents, use them.</p>
<h2 class="wp-block-heading">Try it yourself</h2>
<p class="wp-block-paragraph">The <a href="https://github.com/realworldtech/claude-implement-skill"><code>/implement</code> skill</a> is publicly available and works with Claude Code. We’d encourage you to try it on a real project and see how it changes the way your agent handles specifications.</p>
<p class="wp-block-paragraph">The principles matter more than any specific tool. <a href="https://github.com/realworldtech/claude-implement-skill"><code>/implement</code></a> is one implementation of them — freely available, and a good starting point. Your specification workflow should reflect how your team actually works; ours reflects how we work, and that’s the point.</p>
<p class="wp-block-paragraph">If you want proof the methodology holds up in practice, <a href="https://github.com/realworldtech/props">Props</a> is an open source inventory management system we built specification-first using exactly this process.</p>
<p class="wp-block-paragraph">We’re continuing this research. Multi-skill interference, sub-agent schema compliance, and partial completion detection are all open problems we’re actively working on. If you’re tackling similar challenges, we’d love to hear from you.</p>
If you’ve spent any serious time building software with an AI coding agent — Claude Code, Cursor, Copilot Workspace, Kiro, or any of the others — you’ve probably noticed something uncomfortable. The agent starts brilliantly. It reads your specification, creates a thoughtful work plan, and begins implementing with real understanding. Then, somewhere around the 30-minute mark, things quietly fall apart.
Requirements get simplified. Edge cases vanish. Features appear that nobody asked for. And the agent, if you ask it, will cheerfully tell you everything is on track.
At Real World, we’ve been researching this for around 18 months — and what we’ve found goes well beyond “just give it a bigger context window.”
The problem has a name now
The research community has converged on a term: specification drift. It refers to the progressive loss of connection between an AI agent’s output and the requirements it was given. It’s a specific manifestation of a broader phenomenon. It is called context rot. LLM performance degrades as input volume grows, even when every token in the context is relevant.
This isn’t a niche concern. Laban et al. measured an average 39% performance drop in multi-turn versus single-turn LLM interactions across 200,000 simulated conversations. Du et al. found performance degrades up to 85% as input length increases, even when the model can perfectly retrieve all relevant information. Liu et al.’s foundational “Lost in the Middle” study showed that LLMs attend reliably to the beginning and end of their context. However, they significantly degrade for information in the middle. This is exactly where your specification ends up once generated code starts accumulating.
Here’s how it plays out in practice:
The agent reads your spec, builds a plan, and starts implementing. As the conversation grows, the context window fills up and compaction kicks in — the system automatically summarises older content to make room. During compaction, specific requirements, section references, and constraints are lost. The agent continues implementing from a degraded recollection of what was asked for, not the actual specification. Gaps are discovered late. Expensive rework follows. And the cycle repeats.
We observed this consistently across production projects. Gap analyses at the end of implementation sessions consistently revealed requirements the agent had marked as complete but that were actually missing key behaviours. The specification was still “in context” — but effectively invisible.
What everyone else is doing (and why it’s not enough)
2025 saw spec-driven development emerge as a recognised practice. Thoughtworks called it one of the year’s key new engineering practices. GitHub released Spec Kit. Amazon launched Kiro with a built-in spec-to-code pipeline. JetBrains Junie adopted spec-driven workflows. The basic idea — write a specification before you write code — is sound, and it’s a meaningful step forward from “vibe coding.”
But every one of these tools shares a fundamental limitation: they rely on soft guardrails. They tell the agent “do not write code yet” through prompt-level instructions, and hope the agent listens.
This isn’t a prompting failure. It’s a structural one. The AgentIF benchmark showed that even the best-performing models follow fewer than 30% of agentic instructions perfectly. Research from Anthropic (Denison et al., 2024) demonstrated that LLMs generalise from simple specification gaming to sophisticated reward tampering. Telling an AI agent “please follow the specification” is roughly as effective as telling a developer “please write tests.” The intent is right. The enforcement mechanism is missing.
What we’ve been building
We’ve been approaching this from both ends of the software development lifecycle. On the specification side, we’ve been developing structured approaches to writing specs that survive AI context management. These documents are designed from the ground up. They are meant to be consumed by agents, not just humans. On the implementation side, we’ve built /implement, a publicly available Claude Code skill that enforces specification discipline through structural mechanisms rather than polite suggestions.
The specification work isn’t something we’re releasing as a product — every team’s specification process is different, and we’d encourage you to develop your own. But the principles we’ve discovered apply regardless of how you write your specs. The /implement skill embodies the implementation side and is freely available.
Here’s what we learned building both.
Hard guardrails that survive context loss
The first thing we tried was embedding hard rules in the skill’s instructions — not suggestions, but absolute prohibitions. “Do not produce code during the specification phase.” “Do not mark a requirement as complete without running tests.”
This worked great until context compaction occurred. Then the rules got compressed away along with everything else. The agent lost awareness of its own constraints.
The fix required pairing every hard guardrail with a persistent recovery mechanism — a tracker file on disk that serves as the authoritative source of workflow state. After compaction, the agent reads the tracker, re-establishes where it is in the process, and the guardrails reload. The tracker isn’t just a progress log. It’s a recovery mechanism that embeds its own instructions: “If you’re reading this, here’s what this file means, here’s where we’re up to, and here’s what to do next.”
Neither component is sufficient alone. Hard rules without recovery fail at compaction boundaries. Recovery without hard rules fails under context pressure. The pairing is what works.
Structural indexing: let the orchestrator navigate, not comprehend
When you ask an agent to work with a large specification, the naive approach is to load the whole thing into the conversation. A 130,000-token spec consumes most of the available context window, leaving almost nothing for reasoning.
We developed a pattern we call structural indexing: the main conversation loads only a lightweight index of section identifiers and file sizes. Sub-agents, dispatched to work on specific sections, read the full content directly from disk. The main conversation’s job is navigation and dispatch, not comprehension.
The result was dramatic — context consumption dropped by over 98% with no measurable quality degradation. The insight was architectural: the orchestrating conversation doesn’t need to understand the specification. It needs to know where things are and which agent should read what. Comprehension happens at the edges, in fresh sub-agent contexts with full attention on their assigned work.
This principle turned out to be universal. Every time the main conversation tried to hold large volumes of content — specifications coming in, agent outputs coming back, planning artefacts, even the skill’s own instructions — the same failure mode appeared. The solution was always the same: keep the orchestrator lightweight and let it coordinate, not consume.
Anti-rationalisation: the finding nobody expected
This was the genuinely surprising discovery. We found that LLMs don’t just forget workflow steps — they actively construct locally coherent justifications for skipping them.
An agent assigned to update a tracker file after completing a task would reason: “I know the state from the current conversation, so updating the tracker is redundant right now.” An agent told to write a plan to disk before executing would argue: “The next plan is obvious from context, so I’ll save time by executing directly.” Each justification is individually reasonable-looking. Each one silently breaks the recovery mechanism.
We call these anti-rationalisation failures, and they’re distinct from the model ignoring instructions or adversarial jailbreaking. The model convinces itself, through plausible reasoning, that a required step doesn’t apply right now.
The countermeasure is surprisingly specific: you have to name the exact excuses you want to prohibit. A general rule (“always update the tracker”) can be rationalised around. A rule that says “you must not skip this step, and specifically, these justifications are not valid: ‘I know the state from the current conversation,’‘this is the same session,’‘the next plan is obvious’” — that holds. The named excuses don’t recur.
This has broader implications for anyone designing AI workflows. General rules invite creative interpretation. Specific prohibitions close the rationalisation loop.
Context isolation as a verification advantage
Here’s a reframing we’re particularly proud of. The standard view of LLM context boundaries is that they’re a limitation — agents can’t see each other’s work, so coordination is hard. We found that for verification, isolation is a feature.
In our TDD workflow, the test-writing agent reads only the specification. It never sees the implementation. The implementation agent works from a different context entirely. When their independent interpretations of the specification disagree, that disagreement surfaces ambiguities and catches drift early — before it compounds.
This is conceptually similar to N-version programming (Avizienis, 1985), where multiple independent teams develop from the same specification. The SAGA study validated the principle for LLMs specifically, finding that LLM-generated test suites have systematic blind spots mirroring the generating model’s error patterns. If your test agent sees the implementation, it inherits the implementation’s blind spots.
The Agile Manifesto 25th Anniversary Workshop concluded that test-driven development produces dramatically better results from AI coding agents by preventing them from writing tests that verify broken behaviour. We arrived at the same conclusion independently.
Multi-skill interference: the next frontier
As the skill ecosystem around AI coding agents grows, a new problem is emerging. When multiple skill frameworks coexist in a single session — each with its own workflow assumptions and hard gates — they interfere in ways that neither would exhibit in isolation.
We have identified three distinct failure patterns. The first is workflow capture, where the most recently invoked skill overrides earlier guardrails. Next is sub-agent context isolation, meaning dispatched agents don’t inherit any skill context and default to generic behaviour. The third pattern is planning framework deadlock, with two skills both trying to manage plan execution simultaneously. Recent research has confirmed this isn’t just our experience. Li (2025) found a phase transition in skill selection accuracy as library size grows. Performance drops sharply once semantic confusability between skills reaches a threshold.
We’ve developed pattern-specific countermeasures. However, a general solution to multi-skill interference remains an open question. It is an active area of our research.
What this means if you’re building with AI agents
The METR randomised controlled trial (Becker et al., July 2025) found that experienced open-source developers completed tasks 19% slower with AI assistance — while believing they were 24% faster. This perception gap is the real danger. Teams won’t self-correct toward better specification practices because they genuinely believe things are going well.
Structural enforcement of specification discipline is necessary precisely because the humans in the loop can’t accurately assess when AI assistance is helping versus hindering.
If you take one thing from this post, let it be this: the problem isn’t the AI. The problem is the absence of structural discipline around the AI. Every principle we’ve discovered boils down to the same insight — don’t rely on the agent’s good intentions. Build the structure that makes doing the right thing the only available path.
Treat specifications as persistent artefacts on disk, not conversation context. Use hard enforcement, not soft suggestions. Keep your orchestrator lightweight and delegate comprehension to sub-agents. Name the specific excuses you want to prevent, because general rules invite creative interpretation. And exploit context isolation for verification — don’t fight the boundaries between agents, use them.
Try it yourself
The /implement skill is publicly available and works with Claude Code. We’d encourage you to try it on a real project and see how it changes the way your agent handles specifications.
The principles matter more than any specific tool. /implement is one implementation of them — freely available, and a good starting point. Your specification workflow should reflect how your team actually works; ours reflects how we work, and that’s the point.
If you want proof the methodology holds up in practice, Props is an open source inventory management system we built specification-first using exactly this process.
We’re continuing this research. Multi-skill interference, sub-agent schema compliance, and partial completion detection are all open problems we’re actively working on. If you’re tackling similar challenges, we’d love to hear from you.
Great article Andrew. It was thought provoking for this (coding savvy) ex-developer.
I hope this becomes the first in a series that follows the same theme.