An inventory of the techniques that fill the window, the phenomena that degrade it, the heuristics to master it. And along the way, the most expensive anti-pattern in production agents.

Prerequisite. This article assumes you know what a token is, roughly how a transformer works, and why the model receives the entire history on every turn. If those notions aren't already in place, the companion article What's Really Happening When You Talk to an AI sets the stage in fifteen minutes.
§ 01 — Inventory

Everything we've invented to tame a token predictor.

Lien vers la section Everything we've invented to tame a token predictor.

A transformer on its own does just one thing: predict the next token from what it has in front of it. To make it useful in production — so it answers, remembers, acts, holds up over time — we've invented a dozen techniques. Each one fills a specific gap. Each one inhabits the context window in some way. Here's the inventory, framed by what it costs and what it unlocks.

Frame the behavior · the system prompt

Lien vers la section Frame the behavior · the system prompt

The instruction text placed at the head of every request. Defines role, tone, rules, guardrails, output format, sometimes examples. It's what turns a text predictor into an assistant. Cost: permanent, and paid on every turn. Often 5,000 to 25,000 tokens for a consumer product, more for an agent with lots of tools.

Personalize without duplicating · user preferences

Lien vers la section Personalize without duplicating · user preferences

A small extra block specific to the user, injected before the conversation — language, tone, expertise, current projects. Cost: low in tokens but high in priority — these lines weigh heavily on prediction.

Grant capabilities · tools and MCP

Lien vers la section Grant capabilities · tools and MCP

A model can't read a file, query a database, or send an email — it just produces text. The fix: declare tools the model invokes by writing a structured call (function calling, tool use), which the application executes on its behalf. The Model Context Protocol (MCP) standardizes how tools are declared and exposed, letting you plug in third-party servers (Asana, Gmail, GitLab, internal databases…) without rewriting the pipeline. Cost: every declared tool occupies the window — JSON schema, description, parameters — even when it's never called. Wire up ten MCP servers and you're paying that bill ten times over.

Teach procedures · skills

Lien vers la section Teach procedures · skills

SKILL.md files containing procedural recipes injected only when a trigger matches. Instead of bloating the system prompt with every possible recipe, you store them separately and load on demand. Cost: zero until they're activated; modest when they are. The big trap — a poorly designed skill can pull data into the window that should have been processed elsewhere. That's the subject of § 04.

Keep the thread · conversation history

Lien vers la section Keep the thread · conversation history

The model is stateless. To make a conversation feel continuous, the application reconstructs the full history on every turn. Cost: linear in the number of exchanges. By turn 40, you're paying the same price 40 times over.

Compress the old stuff · automatic summarization

Lien vers la section Compress the old stuff · automatic summarization

When you're approaching the limit, the application replaces older turns with a condensed summary produced by the model itself. Cost: compression is irreversible — a detail erased doesn't come back.

Persist across conversations · memory

Lien vers la section Persist across conversations · memory

A separate store from the history, holding durable facts (preferences, projects, professional context) reinjected into the window when relevant. Cost: low in tokens, but demands discipline — what to remember, what to forget, what to suggest.

Retrieve instead of loading everything · RAG

Lien vers la section Retrieve instead of loading everything · RAG

A document corpus (hundreds of docs, thousands of pages) doesn't fit in the window. Retrieval-Augmented Generation indexes the corpus separately, and at query time, only fetches the relevant passages for injection. The recent evolution — agentic RAG — lets the agent decide when and what to retrieve rather than imposing a frozen pre-LLM step. Cost: indexing infrastructure on the side, and answer quality depends on retrieval quality.

Cut the cost of stable prefixes · prompt caching

Lien vers la section Cut the cost of stable prefixes · prompt caching

Every request recomputes the system prompt and tool definitions — even when nothing has changed. Providers now cache the attention computation (KV cache) for stable portions. On subsequent requests, those tokens cost a fraction of their normal price and latency drops. Cost: zero in tokens — it's pure optimization — but it requires keeping the prefix identical from one request to the next, byte-for-byte.

Isolate the noise · sub-agents

Lien vers la section Isolate the noise · sub-agents

Some tasks demand reading large volumes (web, files, multiple searches) that would saturate the parent's window. Delegate to a sub-agent that has its own window, processes the noise on its side, and returns only a compact summary. Also enables parallelization. Cost: every sub-agent pays for its own system prompt and its own tools; summary compression remains irreversible. See § 06.

Compact the context · the background operation

Lien vers la section Compact the context · the background operation

Over a long agentic session — tools called, files read, sub-agents invoked — the window fills with material that's no longer relevant. Compaction prunes or summarizes peripheral portions to free up space. It's the more general idea that summarization is just one instance of. Cost: like any compression, you lose something. The challenge is to lose the right thing.

The typical allocation

Lien vers la section The typical allocation
Fig. 1Typical allocation in a production agent
┌─ WINDOW ~200,000 tokens ─┐ SYSTEM ~15-25k TOOLS defs SKILLS TOOL RESULTS the main vector of saturation HISTORY ↑ user └─ each active solution = a block of tokens you pay for all solutions share the same pool
Every technique leaves a footprint here. None are free.
§ 02 — Phenomena

Six things that happen in the window, and that we don't really control.

Lien vers la section Six things that happen in the window, and that we don't really control.

The previous solutions are levers you pull. There are also phenomena you endure — properties of the model, properties of attention, properties of the data — and you have to integrate them as constraints. These six show up in nearly every production agent. Having a name for them is the first step to handling them.

Lost in the middle
The model's attention is not uniform across the window. The beginning and end get priority; the middle is underused. It's an empirically documented architectural effect (the Lost in the Middle paper, Liu et al., 2023), softened in recent models but not gone.
→ SIGNAL · the agent ignores an instruction you know is there, but buried in the middle of a long context. Move it to the start or the end.
Context rot
The fuller the window, the more reasoning quality tends to drop, even well below the theoretical limit. An agent at 150,000 tokens isn't equivalent to the same agent at 30,000. Compaction isn't only about space — it's also about performance.
→ SIGNAL · your agent's first actions are precise, the later ones drift. Compact at 50-60% fill, not at 95%.
Attention dilution
A specific case of context rot: even when the model has the theoretical capacity to look at everything, adding irrelevant content reduces the relative weight of the relevant content. Noise doesn't just cost tokens — it dilutes signal.
→ SIGNAL · adding "just-in-case" documentation degrades performance instead of improving it. Cut what's not useful, never load it "as a precaution".
Tool soup
Past a certain number of declared tools (in practice, around fifteen to twenty depending on the model), the agent starts choosing badly — close tools confused, missing tools ignored, complex tools mis-parameterized. The bigger it gets, the slower and more wrong it gets.
→ SIGNAL · the agent invokes the wrong tool, or forgets one you know was available. Activate tools by phase, not all of them all the time.
Runaway agent
Without an explicit cap, an agent can enter a loop where every tool call produces a result that justifies another tool call. The window swells, quality drops, and the bill climbs in silence. Particularly common when the agent searches, doesn't find, and rephrases.
→ SIGNAL · a "simple" session burns ten times the tokens you expected. Set a cap on tool calls, add checkpoints, and trigger compaction or stop at fill thresholds.
Prompt injection
Any external content — web page, email, file, tool result — can carry hidden instructions that hijack the agent. The model doesn't naturally distinguish data from orders. The more powerful the agent's tools, the more serious the risk. Mandatory mental hygiene: treat third-party content as potentially hostile.
→ SIGNAL · the agent does something you didn't ask for after reading external content. Mark third-party content, restrict the tools usable after a read, require human validation for irreversible actions.
§ 03 — Heuristics

Eleven principles for arbitrating competing appetites.

Lien vers la section Eleven principles for arbitrating competing appetites.

Knowing the solutions and the phenomena isn't enough: you have to know how to compose them. Here are the heuristics I use, and that I see used in production agents. None is revolutionary on its own; their value comes from the discipline of applying them together. For each, an alarm signal that triggers it, and a case where it doesn't apply.

Measure before you optimize

Before trying to compress or rewrite, know how much each artifact actually weighs. Every modern API exposes a token count per message. Count first, target the biggest line item, then optimize.

Signal You "feel" the agent is dragging but you don't know where. Open the logs, count tokens by category (system, tools, history, results).
Except when Quick prototype to validate an idea. Don't optimize what isn't stable yet.
Precision over exhaustiveness in the system prompt

The reflex is to stuff the system prompt with examples "just in case". A long system prompt fatigues the model (see context rot) and inflates the cost of every request. Better a tight framing that delegates the details to skills loaded on demand.

Signal System prompt > 30k tokens, or with sections that are never triggered, or rewritten every sprint.
Except when The business context is so specialized that no skill can replace it (strict regulation, non-negotiable brand voice).
Only wire up the tools you need

Every declared tool occupies the window even when it's never used. Wiring up ten MCP servers "for the future" means spending thousands of tokens permanently and feeding the tool soup. Activating tools by task profile or by phase produces noticeably better agents.

Signal More than fifteen tools declared, or an agent that hesitates between two close tools.
Except when You measure and you know that no tool is superfluous. In that case, document the reason for each.
Never load a raw file when you can process it with code

This is the single most important principle — § 04 is dedicated to it. Asking the model to "look at" a 100,000-line CSV or a fifty-page PDF is the most common cause of saturation. Giving the model a way to write code that operates on the data and only bring back the result is the fundamental pivot.

Signal A single tool call brings back more than 5,000 tokens of context.
Except when The file is small (< 2k tokens) and the model needs to grasp its entirety (a nuanced re-read of a short text, for example).
Put the essentials at the edges

Given the lost in the middle effect, critical instructions belong at the start or end of the window. The business rule you don't want ignored? At the end of the system prompt. The most important immediate instruction? In the last user message.

Signal A documented instruction isn't being followed. Before deciding "the model is dumb", check its position in the window.
Except when You have little content and everything fits in a short horizon. The rule only kicks in beyond a few thousand tokens.
Stabilize the prefix to enable the KV cache

Prompt caching only works if the leading portion is identical from one request to the next, byte-for-byte. Putting today's date or a session ID right at the top invalidates the cache on every turn. Keeping the prefix immutable and placing variable elements further down is a free optimization — typically 80-90% reduction on stable-prefix cost, and latency cut by two or three times.

Signal Your Anthropic / OpenAI calls don't show a cache hit even though the system prompt is "identical".
Except when Your requests are rare or irregular — the cache has a limited lifetime (5 min on Anthropic by default).
Compact early, not in a panic

Waiting for the window to be full to compact means compacting in a hurry — and badly. Well-built agents trigger compaction by threshold (60% fill is a good starting point), with a deliberate strategy: what to summarize, what to prune, what to keep verbatim.

Signal Compaction kicks in at 95%, or worse, doesn't exist and long sessions crash.
Except when You're in a session that's short by construction (single-turn, or with a hard cap on calls).
Delegate noisy work to sub-agents

Any task that involves reading a lot to produce a little — web exploration, large file reading, multi-source research — is a natural candidate for a sub-agent. The parent keeps its window light; the sub-agent absorbs the noise in its own and only returns a summary.

Signal The main agent's context is 70% filled with search results or raw content.
Except when The task requires the parent to see the detail (audit, traceability, multi-step reasoning over specific items).
Treat all external content as hostile

A web page, an email, a tool result are data — and they can carry hidden instructions (see prompt injection). For agents with sensitive tools (sending emails, accessing internal systems, executing code), this is non-negotiable. Mark third-party content, restrict the tools usable after a read, require human validation for irreversible actions — disciplines, not options.

Signal Your agent has access to email, a browser, or external data AND can execute side-effecting actions.
Except when The agent is purely read-only and has no side-effecting tools. The risk becomes theoretical.
Remember what lasts, not what passes

Persistent memory is precious but treacherous. You put durable facts in there (preferences, ongoing projects, professional context), not micro-details from a conversation. Useful rule: if the information isn't relevant in at least three future conversations, it has no business being in memory.

Signal Memory contains "the user said X on Tuesday" for X's that will never come back. Or worse, accumulated contradictions.
Except when It's explicitly a note-taking or personal-journal agent — granular retention is then the feature.
Iterate with evals, not by gut feel

Context optimization is like performance tuning: intuition is often wrong. Building a few reproducible tests — here's a question, here's the expected answer — and measuring the impact of each change prevents silent regressions. Adding a tool or a skill without measuring degrades things surprisingly fast.

Signal You add a feature and another behavior, with no apparent connection, becomes unstable.
Except when You're in pure exploration and performance isn't yet a criterion. Once in production, no more excuses.
§ 04 — The anti-pattern

Skills that read vs. skills that execute.

Lien vers la section Skills that read vs. skills that execute.

This is the most poorly understood distinction in agent engineering. A skill isn't a place where you drop data for the model to contemplate: it's an instruction manual for operating on it outside the context. It's also the optimization with the most spectacular gains — often two orders of magnitude on token consumption.

↯ Anti-pattern

The skill that reads

Loads the raw file into the window, asks the model to look at everything and then summarize it. Expensive, slow, fragile, capped by file size, and subject to context rot.

✓ Good pattern

The skill that executes

Teaches the model to write code that operates on the data — analyze, filter, aggregate, validate. Only the compact result comes back into context. Code sees the bytes, the model sees the aggregate.

The real cost, in numbers

Lien vers la section The real cost, in numbers

Concrete case: "How many transactions over $1,000 are there in this 100,000-row CSV?" The file is roughly 8 MB of text, which is roughly 2 million tokens. Let's compare the two trajectories:

A · The skill that reads (anti-pattern)
itemtokens
→ Attempt to load the whole thing2,000,000
→ Window limit exceeded (200k)failure
→ Fallback strategy: chunking + summaries~180,000
→ Result: approximation, no exact countimprecise
TOTAL · 1 approximate answer~180,000 tk
B · The skill that executes (good pattern)
itemtokens
→ Skill loaded into context~400
→ Model writes a Python script~200
→ Script reads the CSV outside context (pandas)0
→ Script output in context: "47,322"~5
TOTAL · 1 exact answer~605 tk

Ratio ~300×. And along the way: answer B is exact whereas A is necessarily approximate. The good pattern is faster, cheaper, and more precise. It's not a tradeoff — it's just a better architecture.

Fig. 2Two data trajectories
A · DIRECT READ file 8 MB everything flows in WINDOW SATURATED the model 'looks at' everything summary approximate B · CODE EXECUTION file 8 MB stays on disk SKILL.md → writes code exec outside the window data exact code sees the bytes; the model sees the result A · ~180,000 tk · imprecise · capped B · ~600 tk · exact · scalable
A well-designed skill keeps the data on disk and only brings back the result.

This idea — code execution as context compression — is the most cost-effective pattern in contemporary agent engineering. When you design a skill, always ask yourself: does the model need to see the data, or just the result of processing it? The answer is almost always "the result".

§ 05 — Audit

How to measure what's really happening in your window.

Lien vers la section How to measure what's really happening in your window.

The rest of this article assumes you know what your agent is consuming. Most teams I meet only have an intuition. The audit isn't complicated; it just demands you do it once and instrument cleanly.

The four base metrics

Lien vers la section The four base metrics

For every model call, log four numbers. Total input tokens — the full size sent to the model. Output tokens — what the model generated. Cached tokens (cache hit) — what cost the fraction. Tokens billed at full price — the difference. Every serious API (Anthropic, OpenAI, Google) exposes these counters in the response; you just need to capture and aggregate them.

The breakdown by category

Lien vers la section The breakdown by category

Once the totals are known, split the input. How much for the system prompt? How much for tool definitions? How much for history? How much for tool results in the current session? How much for loaded skills? At this stage, most production agents discover that tool results devour 40-60% of the window and nobody knew. That's typically where you should pull.

The health indicators

Lien vers la section The health indicators

Three indicators are worth tracking over time. The cache hit rate — under 70%, your prefix isn't stable. The average window fill at end of session — above 70%, you're in context rot territory. The average number of tool calls per session — if it drifts upward without quality gains, you have a runaway agent in formation.

Practical tools

Lien vers la section Practical tools

At minimum, a middleware that captures API counters and writes them to a database or structured log file. To go further: providers offer dashboards (Anthropic Console, OpenAI Usage), giving a global view but without the per-category breakdown. For Claude Code specifically, the /context command displays the current window's breakdown in real time — it's the most valuable read to learn. More on this in § 07.

§ 06 — Architecture

Sub-agents: isolated windows.

Lien vers la section Sub-agents: isolated windows.

When a parent delegates to a sub-agent, it opens a clean window for it. The sub-agent absorbs the noise — raw reading, searches, exploration — then returns only a compact summary. The parent receives a telegram, not a flood. It's the pattern that lets an orchestration agent handle problems that exceed its own window by a wide margin.

Fig. 3Parallel delegation
PARENT main window lightweight, orchestrates delegates in parallel SUB-AGENT 1 absorbs the noise file · web · search SUB-AGENT 2 isolated window own context SUB-AGENT 3 parallelizable independent COMPACT SUMMARIES ⚠ compression is irreversible
Each sub-agent opens its own window, handles the noise, returns a telegram.

Advantages

Lien vers la section Advantages

Isolation: a sub-agent that saturates its own window doesn't affect the parent. Parallelization: several sub-agents can work simultaneously, which a monolithic agent's single window forbids. Specialization: each sub-agent can have its own system prompt and its own tools, finely tuned to its task.

Limit

Lien vers la section Limit

Compression is irreversible. If the sub-agent omits a detail in its summary, the parent has no way to recover it — short of re-running a delegation, which costs a full new cycle. That's why sub-agents demand particular care in defining their return contract: what must it surface, even if it lengthens the summary?

§ 07 — Practical focus

How this plays out in Claude Code and friends.

Lien vers la section How this plays out in Claude Code and friends.

You're probably using Claude Code, Cursor, Cline, or a homegrown agent built on the Anthropic or OpenAI API. Here's how the previous principles show up in those tools — and where to look to diagnose them.

Read the window in real time

Lien vers la section Read the window in real time

In Claude Code, the /context command displays the exact breakdown of your current window: system prompt, MCP tools, loaded skills, history, tool results. It's the most useful read to learn. Run it regularly during a long session; you'll quickly identify which item is eating space. Most of the time, it's tool results — typically Reads of large files or Bashes returning bulky JSON.

Automatic compaction

Lien vers la section Automatic compaction

Claude Code triggers automatic compaction when the window approaches its limit. Older turns are replaced by a summary. You can also trigger it manually with /compact, adding instructions on what compaction must preserve ("keep the list of files I modified, the Bash commands run and their result"). Compacting early and with explicit instructions almost always gives better results than letting auto-compaction decide alone at the edge of the cliff.

MCP arbitration

Lien vers la section MCP arbitration

When you wire up several MCP servers (GitHub, Linear, database, Sentry, etc.), each adds its own tool definitions permanently. Measure the cost: /context gives it to you. If you see 20-30k tokens in MCP tools that only get used occasionally, consider activating servers per project via configuration rather than globally. It's one of the highest-yield levers on Claude Code.

Skills, in practice

Lien vers la section Skills, in practice

SKILL.md files aren't loaded by default: they're described in the system prompt as an index, and the agent opens them via their view tool when a trigger matches. This design is the direct application of § 04: the procedure only occupies the window on demand, and only when it serves. When you write your own skills, follow the same principle: short instructions, references to code, never raw data packaged into the markdown.

The Task sub-agent

Lien vers la section The Task sub-agent

Claude Code exposes a Task tool that launches a sub-agent with its own context. Excellent application of § 06: delegate multi-file searches, large-directory exploration, code audits to a sub-agent. You'll get back a summary instead of flooding your main context.

Cursor, Cline, Copilot, and the others

Lien vers la section Cursor, Cline, Copilot, and the others

The principles are the same, the instrumentation differs. Cursor exposes less visibility into the window's composition; you often have to go through the API logs. Cline and the open-source agents based on the Model Context Protocol generally expose more detail. Whatever the tool, the question to ask stays the same: what's filling my window, and why?

§ 08 — State of play

Where we are, in May 2026.

Lien vers la section Where we are, in May 2026.

The terrain shifts fast. This section is dated for that reason: what's true at the time of publication may not be six months from now. A few notable trends you can fold into your engineering thinking.

Standard windows have stalled around 200k, but experimental offerings at 1M tokens exist (Claude Sonnet in beta, Gemini for a while now). The per-token cost in "long context" mode stays meaningfully higher, and degradation at large window is more pronounced — in other words, the "1M" option is useful for singular cases (a large document to process at once) but remains a poor default reflex.

The KV cache has become a universal given. Anthropic, OpenAI, and Google all expose prompt caching mechanisms with explicit pricing. If you're not using them, you're leaving money on the table. Stable-prefix discipline is no longer an advanced optimization: it's the baseline expectation.

MCP has become the de facto standard for declaring third-party tools. The ecosystem now includes hundreds of public servers, which is both an opportunity (huge capabilities accessible quickly) and a trap (the tool soup temptation). The 2026 challenge is less about plugging in and more about judiciously choosing what to plug in.

Skills have left the margins. Anthropic popularized them in 2025 with Claude Code; the pattern has spread. Agents without an explicit skill system tend to accumulate everything in the system prompt — meaning they pay permanently for what they could load on demand.

The "code execution as context compression" pattern — the idea from § 04 — has become a topic in the agent engineering community and the subject of technical articles from Anthropic and others. If you haven't applied it in your architecture yet, it's probably the highest-priority item for your next iteration.

Systematic evaluation remains under-practiced. It's the discipline I see least often in place at teams building agents; and paradoxically it's the one that lets you apply all the others with confidence. Things are moving — tools like Promptfoo, Inspect, and Anthropic's evals are spreading — but the gap between teams that evaluate and teams that don't remains considerable.

★ Further reading