Context Is Finite. Program Accordingly.
An inventory of the techniques that fill the window, the phenomena that degrade it, the heuristics to master it. And along the way, the most expensive anti-pattern in production agents.
Everything we've invented to tame a token predictor.
Lien vers la section Everything we've invented to tame a token predictor.A transformer on its own does just one thing: predict the next token from what it has in front of it. To make it useful in production — so it answers, remembers, acts, holds up over time — we've invented a dozen techniques. Each one fills a specific gap. Each one inhabits the context window in some way. Here's the inventory, framed by what it costs and what it unlocks.
Frame the behavior · the system prompt
Lien vers la section Frame the behavior · the system promptThe instruction text placed at the head of every request. Defines role, tone, rules, guardrails, output format, sometimes examples. It's what turns a text predictor into an assistant. Cost: permanent, and paid on every turn. Often 5,000 to 25,000 tokens for a consumer product, more for an agent with lots of tools.
Personalize without duplicating · user preferences
Lien vers la section Personalize without duplicating · user preferencesA small extra block specific to the user, injected before the conversation — language, tone, expertise, current projects. Cost: low in tokens but high in priority — these lines weigh heavily on prediction.
Grant capabilities · tools and MCP
Lien vers la section Grant capabilities · tools and MCPA model can't read a file, query a database, or send an email — it just produces text. The fix: declare tools the model invokes by writing a structured call (function calling, tool use), which the application executes on its behalf. The Model Context Protocol (MCP) standardizes how tools are declared and exposed, letting you plug in third-party servers (Asana, Gmail, GitLab, internal databases…) without rewriting the pipeline. Cost: every declared tool occupies the window — JSON schema, description, parameters — even when it's never called. Wire up ten MCP servers and you're paying that bill ten times over.
Teach procedures · skills
Lien vers la section Teach procedures · skillsSKILL.md files containing procedural recipes injected only when a trigger matches. Instead of bloating the system prompt with every possible recipe, you store them separately and load on demand. Cost: zero until they're activated; modest when they are. The big trap — a poorly designed skill can pull data into the window that should have been processed elsewhere. That's the subject of § 04.
Keep the thread · conversation history
Lien vers la section Keep the thread · conversation historyThe model is stateless. To make a conversation feel continuous, the application reconstructs the full history on every turn. Cost: linear in the number of exchanges. By turn 40, you're paying the same price 40 times over.
Compress the old stuff · automatic summarization
Lien vers la section Compress the old stuff · automatic summarizationWhen you're approaching the limit, the application replaces older turns with a condensed summary produced by the model itself. Cost: compression is irreversible — a detail erased doesn't come back.
Persist across conversations · memory
Lien vers la section Persist across conversations · memoryA separate store from the history, holding durable facts (preferences, projects, professional context) reinjected into the window when relevant. Cost: low in tokens, but demands discipline — what to remember, what to forget, what to suggest.
Retrieve instead of loading everything · RAG
Lien vers la section Retrieve instead of loading everything · RAGA document corpus (hundreds of docs, thousands of pages) doesn't fit in the window. Retrieval-Augmented Generation indexes the corpus separately, and at query time, only fetches the relevant passages for injection. The recent evolution — agentic RAG — lets the agent decide when and what to retrieve rather than imposing a frozen pre-LLM step. Cost: indexing infrastructure on the side, and answer quality depends on retrieval quality.
Cut the cost of stable prefixes · prompt caching
Lien vers la section Cut the cost of stable prefixes · prompt cachingEvery request recomputes the system prompt and tool definitions — even when nothing has changed. Providers now cache the attention computation (KV cache) for stable portions. On subsequent requests, those tokens cost a fraction of their normal price and latency drops. Cost: zero in tokens — it's pure optimization — but it requires keeping the prefix identical from one request to the next, byte-for-byte.
Isolate the noise · sub-agents
Lien vers la section Isolate the noise · sub-agentsSome tasks demand reading large volumes (web, files, multiple searches) that would saturate the parent's window. Delegate to a sub-agent that has its own window, processes the noise on its side, and returns only a compact summary. Also enables parallelization. Cost: every sub-agent pays for its own system prompt and its own tools; summary compression remains irreversible. See § 06.
Compact the context · the background operation
Lien vers la section Compact the context · the background operationOver a long agentic session — tools called, files read, sub-agents invoked — the window fills with material that's no longer relevant. Compaction prunes or summarizes peripheral portions to free up space. It's the more general idea that summarization is just one instance of. Cost: like any compression, you lose something. The challenge is to lose the right thing.
The typical allocation
Lien vers la section The typical allocationSix things that happen in the window, and that we don't really control.
Lien vers la section Six things that happen in the window, and that we don't really control.The previous solutions are levers you pull. There are also phenomena you endure — properties of the model, properties of attention, properties of the data — and you have to integrate them as constraints. These six show up in nearly every production agent. Having a name for them is the first step to handling them.
Eleven principles for arbitrating competing appetites.
Lien vers la section Eleven principles for arbitrating competing appetites.Knowing the solutions and the phenomena isn't enough: you have to know how to compose them. Here are the heuristics I use, and that I see used in production agents. None is revolutionary on its own; their value comes from the discipline of applying them together. For each, an alarm signal that triggers it, and a case where it doesn't apply.
Before trying to compress or rewrite, know how much each artifact actually weighs. Every modern API exposes a token count per message. Count first, target the biggest line item, then optimize.
The reflex is to stuff the system prompt with examples "just in case". A long system prompt fatigues the model (see context rot) and inflates the cost of every request. Better a tight framing that delegates the details to skills loaded on demand.
Every declared tool occupies the window even when it's never used. Wiring up ten MCP servers "for the future" means spending thousands of tokens permanently and feeding the tool soup. Activating tools by task profile or by phase produces noticeably better agents.
This is the single most important principle — § 04 is dedicated to it. Asking the model to "look at" a 100,000-line CSV or a fifty-page PDF is the most common cause of saturation. Giving the model a way to write code that operates on the data and only bring back the result is the fundamental pivot.
Given the lost in the middle effect, critical instructions belong at the start or end of the window. The business rule you don't want ignored? At the end of the system prompt. The most important immediate instruction? In the last user message.
Prompt caching only works if the leading portion is identical from one request to the next, byte-for-byte. Putting today's date or a session ID right at the top invalidates the cache on every turn. Keeping the prefix immutable and placing variable elements further down is a free optimization — typically 80-90% reduction on stable-prefix cost, and latency cut by two or three times.
Waiting for the window to be full to compact means compacting in a hurry — and badly. Well-built agents trigger compaction by threshold (60% fill is a good starting point), with a deliberate strategy: what to summarize, what to prune, what to keep verbatim.
Any task that involves reading a lot to produce a little — web exploration, large file reading, multi-source research — is a natural candidate for a sub-agent. The parent keeps its window light; the sub-agent absorbs the noise in its own and only returns a summary.
A web page, an email, a tool result are data — and they can carry hidden instructions (see prompt injection). For agents with sensitive tools (sending emails, accessing internal systems, executing code), this is non-negotiable. Mark third-party content, restrict the tools usable after a read, require human validation for irreversible actions — disciplines, not options.
Persistent memory is precious but treacherous. You put durable facts in there (preferences, ongoing projects, professional context), not micro-details from a conversation. Useful rule: if the information isn't relevant in at least three future conversations, it has no business being in memory.
Context optimization is like performance tuning: intuition is often wrong. Building a few reproducible tests — here's a question, here's the expected answer — and measuring the impact of each change prevents silent regressions. Adding a tool or a skill without measuring degrades things surprisingly fast.
Skills that read vs. skills that execute.
Lien vers la section Skills that read vs. skills that execute.This is the most poorly understood distinction in agent engineering. A skill isn't a place where you drop data for the model to contemplate: it's an instruction manual for operating on it outside the context. It's also the optimization with the most spectacular gains — often two orders of magnitude on token consumption.
The skill that reads
Loads the raw file into the window, asks the model to look at everything and then summarize it. Expensive, slow, fragile, capped by file size, and subject to context rot.
The skill that executes
Teaches the model to write code that operates on the data — analyze, filter, aggregate, validate. Only the compact result comes back into context. Code sees the bytes, the model sees the aggregate.
The real cost, in numbers
Lien vers la section The real cost, in numbersConcrete case: "How many transactions over $1,000 are there in this 100,000-row CSV?" The file is roughly 8 MB of text, which is roughly 2 million tokens. Let's compare the two trajectories:
Ratio ~300×. And along the way: answer B is exact whereas A is necessarily approximate. The good pattern is faster, cheaper, and more precise. It's not a tradeoff — it's just a better architecture.
This idea — code execution as context compression — is the most cost-effective pattern in contemporary agent engineering. When you design a skill, always ask yourself: does the model need to see the data, or just the result of processing it? The answer is almost always "the result".
How to measure what's really happening in your window.
Lien vers la section How to measure what's really happening in your window.The rest of this article assumes you know what your agent is consuming. Most teams I meet only have an intuition. The audit isn't complicated; it just demands you do it once and instrument cleanly.
The four base metrics
Lien vers la section The four base metricsFor every model call, log four numbers. Total input tokens — the full size sent to the model. Output tokens — what the model generated. Cached tokens (cache hit) — what cost the fraction. Tokens billed at full price — the difference. Every serious API (Anthropic, OpenAI, Google) exposes these counters in the response; you just need to capture and aggregate them.
The breakdown by category
Lien vers la section The breakdown by categoryOnce the totals are known, split the input. How much for the system prompt? How much for tool definitions? How much for history? How much for tool results in the current session? How much for loaded skills? At this stage, most production agents discover that tool results devour 40-60% of the window and nobody knew. That's typically where you should pull.
The health indicators
Lien vers la section The health indicatorsThree indicators are worth tracking over time. The cache hit rate — under 70%, your prefix isn't stable. The average window fill at end of session — above 70%, you're in context rot territory. The average number of tool calls per session — if it drifts upward without quality gains, you have a runaway agent in formation.
Practical tools
Lien vers la section Practical toolsAt minimum, a middleware that captures API counters and writes them to a database or structured log file. To go further: providers offer dashboards (Anthropic Console, OpenAI Usage), giving a global view but without the per-category breakdown. For Claude Code specifically, the /context command displays the current window's breakdown in real time — it's the most valuable read to learn. More on this in § 07.
Sub-agents: isolated windows.
Lien vers la section Sub-agents: isolated windows.When a parent delegates to a sub-agent, it opens a clean window for it. The sub-agent absorbs the noise — raw reading, searches, exploration — then returns only a compact summary. The parent receives a telegram, not a flood. It's the pattern that lets an orchestration agent handle problems that exceed its own window by a wide margin.
Advantages
Lien vers la section AdvantagesIsolation: a sub-agent that saturates its own window doesn't affect the parent. Parallelization: several sub-agents can work simultaneously, which a monolithic agent's single window forbids. Specialization: each sub-agent can have its own system prompt and its own tools, finely tuned to its task.
Limit
Lien vers la section LimitCompression is irreversible. If the sub-agent omits a detail in its summary, the parent has no way to recover it — short of re-running a delegation, which costs a full new cycle. That's why sub-agents demand particular care in defining their return contract: what must it surface, even if it lengthens the summary?
How this plays out in Claude Code and friends.
Lien vers la section How this plays out in Claude Code and friends.You're probably using Claude Code, Cursor, Cline, or a homegrown agent built on the Anthropic or OpenAI API. Here's how the previous principles show up in those tools — and where to look to diagnose them.
Read the window in real time
Lien vers la section Read the window in real timeIn Claude Code, the /context command displays the exact breakdown of your current window: system prompt, MCP tools, loaded skills, history, tool results. It's the most useful read to learn. Run it regularly during a long session; you'll quickly identify which item is eating space. Most of the time, it's tool results — typically Reads of large files or Bashes returning bulky JSON.
Automatic compaction
Lien vers la section Automatic compactionClaude Code triggers automatic compaction when the window approaches its limit. Older turns are replaced by a summary. You can also trigger it manually with /compact, adding instructions on what compaction must preserve ("keep the list of files I modified, the Bash commands run and their result"). Compacting early and with explicit instructions almost always gives better results than letting auto-compaction decide alone at the edge of the cliff.
MCP arbitration
Lien vers la section MCP arbitrationWhen you wire up several MCP servers (GitHub, Linear, database, Sentry, etc.), each adds its own tool definitions permanently. Measure the cost: /context gives it to you. If you see 20-30k tokens in MCP tools that only get used occasionally, consider activating servers per project via configuration rather than globally. It's one of the highest-yield levers on Claude Code.
Skills, in practice
Lien vers la section Skills, in practiceSKILL.md files aren't loaded by default: they're described in the system prompt as an index, and the agent opens them via their view tool when a trigger matches. This design is the direct application of § 04: the procedure only occupies the window on demand, and only when it serves. When you write your own skills, follow the same principle: short instructions, references to code, never raw data packaged into the markdown.
The Task sub-agent
Lien vers la section The Task sub-agentClaude Code exposes a Task tool that launches a sub-agent with its own context. Excellent application of § 06: delegate multi-file searches, large-directory exploration, code audits to a sub-agent. You'll get back a summary instead of flooding your main context.
Cursor, Cline, Copilot, and the others
Lien vers la section Cursor, Cline, Copilot, and the othersThe principles are the same, the instrumentation differs. Cursor exposes less visibility into the window's composition; you often have to go through the API logs. Cline and the open-source agents based on the Model Context Protocol generally expose more detail. Whatever the tool, the question to ask stays the same: what's filling my window, and why?
Where we are, in May 2026.
Lien vers la section Where we are, in May 2026.The terrain shifts fast. This section is dated for that reason: what's true at the time of publication may not be six months from now. A few notable trends you can fold into your engineering thinking.
Standard windows have stalled around 200k, but experimental offerings at 1M tokens exist (Claude Sonnet in beta, Gemini for a while now). The per-token cost in "long context" mode stays meaningfully higher, and degradation at large window is more pronounced — in other words, the "1M" option is useful for singular cases (a large document to process at once) but remains a poor default reflex.
The KV cache has become a universal given. Anthropic, OpenAI, and Google all expose prompt caching mechanisms with explicit pricing. If you're not using them, you're leaving money on the table. Stable-prefix discipline is no longer an advanced optimization: it's the baseline expectation.
MCP has become the de facto standard for declaring third-party tools. The ecosystem now includes hundreds of public servers, which is both an opportunity (huge capabilities accessible quickly) and a trap (the tool soup temptation). The 2026 challenge is less about plugging in and more about judiciously choosing what to plug in.
Skills have left the margins. Anthropic popularized them in 2025 with Claude Code; the pattern has spread. Agents without an explicit skill system tend to accumulate everything in the system prompt — meaning they pay permanently for what they could load on demand.
The "code execution as context compression" pattern — the idea from § 04 — has become a topic in the agent engineering community and the subject of technical articles from Anthropic and others. If you haven't applied it in your architecture yet, it's probably the highest-priority item for your next iteration.
Systematic evaluation remains under-practiced. It's the discipline I see least often in place at teams building agents; and paradoxically it's the one that lets you apply all the others with confidence. Things are moving — tools like Promptfoo, Inspect, and Anthropic's evals are spreading — but the gap between teams that evaluate and teams that don't remains considerable.
- Anthropic · Effective context engineering for AI agents The founding article on the discipline, by Anthropic's applied AI team.
- Liu et al. · Lost in the Middle (2023) The paper that empirically documented attention's non-uniformity across the window.
- Model Context Protocol · specification The open standard for declaring and exposing tools to agents.
-
Anthropic · Claude Code documentation
The reference for the
/contextand/compactcommands, and the skill system. - Anthropic · Prompt caching How to enable the KV cache and structure your prefix to get the most out of it.
- Companion article · What's Really Happening When You Talk to an AI The conceptual foundations, for sharing with a less technical audience.