Three ideas are enough to understand — and demystify — ChatGPT, Claude, Gemini, and all their cousins: the token, the transformer, and the context window. No formulas, just analogies that hold up.

§ 01 — The raw material

The AI doesn't read words. It reads tokens.

Lien vers la section The AI doesn't read words. It reads tokens.

First surprise: when you type "Hello, how are you?" to an AI, it doesn't see your sentence, your words, or even your letters. It sees a sequence of tokens — fragments produced by an automatic split of your text.

A token isn't a whole word or a single letter: it's a fragment, somewhere in between. In English, a token is typically 3 to 4 characters, or about three quarters of a word. The word window may fit in a single token because it's frequent. windows might split in two (window + s). A rare proper noun or a technical word can shatter into four or five pieces.

Fig. 1A sentence, as the AI sees it
┌─ WHAT YOU WRITE "La fenêtre de contexte est finie." automatic tokenization └─ WHAT THE AI SEES La fenêtre de contexte est fin ie . 1 2 3 4 5 6 7 8 8 tokens · "finie" splits in two
Tokenization favors frequent fragments. The French word *finie* splits as *fin* + *ie* (French example kept for layout).

Why does this matter to you? Because everything in AI tools is measured in tokens: the bill if you pay by usage, the maximum length of a conversation, the size of documents you can analyze. When a provider advertises "200,000 tokens of context," that's roughly 500 pages of book. When you paste a document in, it gets sliced into tokens before the model looks at it.

§ 02 — The mechanics

One operation, repeated thousands of times: predict the next token.

Lien vers la section One operation, repeated thousands of times: predict the next token.

Here's the most counterintuitive idea in the field, and the one that changes everything: however sophisticated it gets, a large language model fundamentally does only one thing. Given a sequence of tokens, predict the one that comes next.

No global planning. No upfront thinking about the whole answer. No hidden plan. One token at a time, in a loop that only stops when the model decides it's done.

How does it do it? The architecture that performs this prediction is called a transformer. What you need to remember, without diving into the machinery, is its central principle — attention. For each token to produce, the model weighs the relative importance of every token already there. Each word looks at all the others and decides which ones matter. A kind of full re-read at every step.

Fig. 2The loop, one step at a time
STEP t La fenêtre de contexte est ? TRANSFORMER attention over all tokens PROBABILITIES · top 4 candidates finie · 0.62 limitée · 0.18 large · 0.07 vaste · 0.04 picks STEP t+1 … est finie ? loop again one token at a time, until done
At each step, the model re-reads the whole input to pick one token. (Example continues from Fig. 1.)

This mechanic has a striking practical consequence. When the AI replies to you, it doesn't know, at the moment it writes the first word, how it will end its sentence. It writes, word after word, re-reading itself at every step to decide the next one. What looks like fluid thought is a string of probabilistic micro-decisions. That's why an AI can start a confident answer and end up with a false claim — it "drifted along" with its own generation.

§ 03 — The field of view

The context window, or why your AI "forgets."

Lien vers la section The context window, or why your AI "forgets."

If the model only predicts the next token from what came before, it needs a horizon — the amount of tokens it can "see" at once. That horizon is the context window.

This is the central notion to internalize if you use AI tools regularly. The window is everything at once: the model's zone of attention, its field of view, and its only support for information. Anything inside it can shape the response; anything outside doesn't exist for it.

This window has a maximum size, fixed when the model is built, measured in tokens. Depending on the model, you're talking a few thousand to several hundred thousand tokens. For current Claude models, for instance, the standard window is around 200,000 tokens — equivalent to a 500-page book. Beyond that, you can't add anything: you have to remove existing content to make room.

Fig. 3The window, at a glance
┌─ CONTEXT WINDOW maximum capacity ─┐ tokens already present · what the model "sees" available space for what comes next ↑ ~200k tokens beyond: impossible the model predicts here, from all of this
A strip of tokens with a hard limit. No memory anywhere else.

Why your AI "forgets" after a while

Lien vers la section Why your AI "forgets" after a while

You've probably had this experience: in a long conversation, the assistant seems to forget what you told it earlier. Not a bug. A direct consequence of what we just saw. When the history hits the window's limit, the application driving the model has to cut: either it prunes old messages, or it replaces them with a shorter summary. Either way, the original detail is lost to the model.

For the same reason, loading an 800-page PDF into a 200,000-token window may simply not fit. Past that, the tool has to get clever — chunk the document, load only relevant excerpts, or refuse. No magic.

§ 04 — The transformation

From a text predictor to an assistant that answers.

Lien vers la section From a text predictor to an assistant that answers.

Here's the second counterintuitive idea in the field. A transformer, left to itself, doesn't "answer" questions. It continues text. Give it "The capital of France is" and it will most likely complete with "Paris." Give it "Hello, how are you?" and it could just as easily continue with "asked Mary as she opened the door." — because that, too, is a plausible continuation in the corpus of texts it was trained on.

For it to behave like an assistant — to answer instead of continue — you have to give it a frame. That frame is called the system prompt: a text placed before the conversation that defines who the model is, what it should do, how it should express itself, and what it should refuse. It's the first thing it "reads" on every request, and everything else is built on top of it.

The system prompt is the mechanism that powers the products you use — ChatGPT, Claude.ai, Gemini, Copilot, Le Chat. Behind every conversational interface, an instruction text turns a raw predictor into an assistant. Here's what that looks like in practice:

You are an assistant for developers, specialized in SQL databases. You answer in English. For each question, you first provide the SQL query, then a brief explanation. If a question is outside your domain, you say so.

User: How do I find all customers who ordered in March?

Assistant: SELECT * FROM customers c JOIN orders o ON c.id = o.customer_id WHERE EXTRACT(MONTH FROM o.date) = 3; — the join brings customers together with their orders, the EXTRACT filter isolates the ones from March.

Change the system prompt, you change the assistant. Here's the same model, with another identity:

You are Lea, a customer support agent for the ACME online store. You are polite, empathetic, and always formal. You only discuss orders, deliveries, and returns. For any other question, you redirect to the general contact form. You never reveal that you are an AI.

User: My order #4521 still hasn't arrived.

Assistant: Hello, I'm sorry for the delay. Could you confirm your postal code so I can check the status of order #4521?

Same brain, two personalities. Everything is continuous text — the User: and Assistant: labels are nothing more than textual markers that help the model know when its turn to continue arrives.

§ 05 — The missing memory

The model remembers nothing.

Lien vers la section The model remembers nothing.

Here's the third idea to internalize, and it has very concrete implications for you. The transformer is stateless. Between two requests, it has no memory of what was said. None. For a conversation to feel continuous, the application talking to the model has to resend the entire conversation on every turn.

When you type "And its population?" in a discussion that was about Canada, the application rebuilds the whole history behind the scenes and sends it to the model:

You are a helpful, honest, and concise assistant. You answer in English.

User: What is the capital of Canada?

Assistant: The capital of Canada is Ottawa, in Ontario.

User: And its population?

Assistant:

Everything is there, in one long string. The model receives that block, sees that it ends with Assistant: followed by a cursor, and continues the text. Without that full reconstruction, it would have no idea what "its" refers to in the last question.

Fig. 4One conversation, two requests
TURN 1 SYS · frame USR · capital? → model predicts MODEL "Ottawa, in Ontario." TURN 2 · contains everything before SYS · frame USR · capital? AST · Ottawa… USR · population? MODEL "~1.1 million." keep it all the history grows with every turn the model itself has no memory between requests
The app rebuilds the history on every call. The app is what 'remembers,' not the model.

This absence of internal memory has a very concrete consequence: each new exchange in a conversation re-pays the cost of everything that came before. The further the conversation goes, the more expensive each turn is in tokens, and the more the window fills up. That's why very long conversations end up dragging, slowing down, or restarting in a new thread.

It's also why modern products are starting to expose persistent memory features — a separate store from the conversation where the system records lasting facts about you (preferences, projects, professional context) to re-inject when relevant. It's not the model that remembers: it's the application that reminds it.

§ 06 — Action

How an AI can act on the world.

Lien vers la section How an AI can act on the world.

If an AI only predicts tokens, how can it "read a file," "search the web," or "send an email"? The answer is elegant: it still does nothing other than produce text — but that text can take the shape of an action instruction that the host program will recognize and execute on its behalf.

The trick comes down to two ingredients. First, you teach the model, in its system prompt, that it has access to tools: read a file, search the web, run code, etc. Second, the application watches what the model writes. When it produces a line that looks like a tool call — something like read_file("/data/report.txt") — the application intercepts it, actually runs the operation, and injects the result into the conversation. From the model's perspective, everything stays continuous text. From the application's perspective, it's the one doing the real work.

Here's what a full cycle looks like, in continuous text:

User: Summarize the file /data/report.txt for me.

Action: read_file("/data/report.txt") Observation: The quarterly report shows a 12% revenue increase, an 8% drop in infrastructure costs, and three strategic recommendations [...4,200 tokens total...]

Reply: The report shows a 12% revenue increase, an 8% drop in costs, and three strategic recommendations for next quarter.

The model requests an action. The application performs it. The result comes back into context, the model sees it as if it had always known, and continues. This is the fundamental mechanic of modern assistants — Claude reading your Google Drive, ChatGPT searching the web, GitHub Copilot editing your code. Always the same loop: the AI asks, the app executes, the result returns to context.

The cost on the window

Lien vers la section The cost on the window

All of this leaves a trace in the window, and every trace costs tokens. Reading a fifty-page file means dropping fifty pages into the window. Doing ten web searches means adding ten pages of results. That's why modern agents — the ones that chain actions on their own — can saturate their window surprisingly fast. And it's also the main subject of the next article, for anyone who wants to go deeper.

§ 07 — Takeaways

Three ideas that explain everything.

Lien vers la section Three ideas that explain everything.

If you leave this page with three things in mind, let them be these. First — the AI reads tokens, not words, and everything it sees has to fit in a fixed-size window. Second — it does only one operation, predict the next token, in a loop that re-reads the input at every step. Third — it has no memory between requests: the application around it simulates continuity by resending the history on every turn, and actually executes the tools the model asks for.

With those three ideas, you can explain why your assistant forgets after a while, why a long document might "not fit," why the same model behaves differently from one product to another, and why an agent that consults a lot of sources can become slow or imprecise. Everything you read about the topic afterward — RAG, MCP, compaction, sub-agents — will be variations on these same constraints.

★ Further reading

If you build with agents, the story continues.

This article lays the foundations. If you use Claude Code, Cursor, custom agents, or you design your own tools on top of these models, the context window becomes a resource you have to manage actively: arbitrating between system prompt, tools, history, operation results, and persistent memory.

The next article explores all of that in detail — the full toolkit of agent engineering, the phenomena that degrade quality, and the practical heuristics for staying below saturation.

Read the practitioner version