What's Really Happening When You Talk to an AI
Three ideas are enough to understand — and demystify — ChatGPT, Claude, Gemini, and all their cousins: the token, the transformer, and the context window. No formulas, just analogies that hold up.
The AI doesn't read words. It reads tokens.
Lien vers la section The AI doesn't read words. It reads tokens.First surprise: when you type "Hello, how are you?" to an AI, it doesn't see your sentence, your words, or even your letters. It sees a sequence of tokens — fragments produced by an automatic split of your text.
A token isn't a whole word or a single letter: it's a fragment, somewhere in between. In English, a token is typically 3 to 4 characters, or about three quarters of a word. The word window may fit in a single token because it's frequent. windows might split in two (window + s). A rare proper noun or a technical word can shatter into four or five pieces.
Why does this matter to you? Because everything in AI tools is measured in tokens: the bill if you pay by usage, the maximum length of a conversation, the size of documents you can analyze. When a provider advertises "200,000 tokens of context," that's roughly 500 pages of book. When you paste a document in, it gets sliced into tokens before the model looks at it.
One operation, repeated thousands of times: predict the next token.
Lien vers la section One operation, repeated thousands of times: predict the next token.Here's the most counterintuitive idea in the field, and the one that changes everything: however sophisticated it gets, a large language model fundamentally does only one thing. Given a sequence of tokens, predict the one that comes next.
No global planning. No upfront thinking about the whole answer. No hidden plan. One token at a time, in a loop that only stops when the model decides it's done.
How does it do it? The architecture that performs this prediction is called a transformer. What you need to remember, without diving into the machinery, is its central principle — attention. For each token to produce, the model weighs the relative importance of every token already there. Each word looks at all the others and decides which ones matter. A kind of full re-read at every step.
This mechanic has a striking practical consequence. When the AI replies to you, it doesn't know, at the moment it writes the first word, how it will end its sentence. It writes, word after word, re-reading itself at every step to decide the next one. What looks like fluid thought is a string of probabilistic micro-decisions. That's why an AI can start a confident answer and end up with a false claim — it "drifted along" with its own generation.
The context window, or why your AI "forgets."
Lien vers la section The context window, or why your AI "forgets."If the model only predicts the next token from what came before, it needs a horizon — the amount of tokens it can "see" at once. That horizon is the context window.
This is the central notion to internalize if you use AI tools regularly. The window is everything at once: the model's zone of attention, its field of view, and its only support for information. Anything inside it can shape the response; anything outside doesn't exist for it.
This window has a maximum size, fixed when the model is built, measured in tokens. Depending on the model, you're talking a few thousand to several hundred thousand tokens. For current Claude models, for instance, the standard window is around 200,000 tokens — equivalent to a 500-page book. Beyond that, you can't add anything: you have to remove existing content to make room.
Why your AI "forgets" after a while
Lien vers la section Why your AI "forgets" after a whileYou've probably had this experience: in a long conversation, the assistant seems to forget what you told it earlier. Not a bug. A direct consequence of what we just saw. When the history hits the window's limit, the application driving the model has to cut: either it prunes old messages, or it replaces them with a shorter summary. Either way, the original detail is lost to the model.
For the same reason, loading an 800-page PDF into a 200,000-token window may simply not fit. Past that, the tool has to get clever — chunk the document, load only relevant excerpts, or refuse. No magic.
From a text predictor to an assistant that answers.
Lien vers la section From a text predictor to an assistant that answers.Here's the second counterintuitive idea in the field. A transformer, left to itself, doesn't "answer" questions. It continues text. Give it "The capital of France is" and it will most likely complete with "Paris." Give it "Hello, how are you?" and it could just as easily continue with "asked Mary as she opened the door." — because that, too, is a plausible continuation in the corpus of texts it was trained on.
For it to behave like an assistant — to answer instead of continue — you have to give it a frame. That frame is called the system prompt: a text placed before the conversation that defines who the model is, what it should do, how it should express itself, and what it should refuse. It's the first thing it "reads" on every request, and everything else is built on top of it.
The system prompt is the mechanism that powers the products you use — ChatGPT, Claude.ai, Gemini, Copilot, Le Chat. Behind every conversational interface, an instruction text turns a raw predictor into an assistant. Here's what that looks like in practice:
User: How do I find all customers who ordered in March?
Assistant: SELECT * FROM customers c JOIN orders o ON c.id = o.customer_id WHERE EXTRACT(MONTH FROM o.date) = 3; — the join brings customers together with their orders, the EXTRACT filter isolates the ones from March.
Change the system prompt, you change the assistant. Here's the same model, with another identity:
User: My order #4521 still hasn't arrived.
Assistant: Hello, I'm sorry for the delay. Could you confirm your postal code so I can check the status of order #4521?
Same brain, two personalities. Everything is continuous text — the User: and Assistant: labels are nothing more than textual markers that help the model know when its turn to continue arrives.
The model remembers nothing.
Lien vers la section The model remembers nothing.Here's the third idea to internalize, and it has very concrete implications for you. The transformer is stateless. Between two requests, it has no memory of what was said. None. For a conversation to feel continuous, the application talking to the model has to resend the entire conversation on every turn.
When you type "And its population?" in a discussion that was about Canada, the application rebuilds the whole history behind the scenes and sends it to the model:
User: What is the capital of Canada?
Assistant: The capital of Canada is Ottawa, in Ontario.
User: And its population?
Assistant: ▮
Everything is there, in one long string. The model receives that block, sees that it ends with Assistant: followed by a cursor, and continues the text. Without that full reconstruction, it would have no idea what "its" refers to in the last question.
This absence of internal memory has a very concrete consequence: each new exchange in a conversation re-pays the cost of everything that came before. The further the conversation goes, the more expensive each turn is in tokens, and the more the window fills up. That's why very long conversations end up dragging, slowing down, or restarting in a new thread.
It's also why modern products are starting to expose persistent memory features — a separate store from the conversation where the system records lasting facts about you (preferences, projects, professional context) to re-inject when relevant. It's not the model that remembers: it's the application that reminds it.
How an AI can act on the world.
Lien vers la section How an AI can act on the world.If an AI only predicts tokens, how can it "read a file," "search the web," or "send an email"? The answer is elegant: it still does nothing other than produce text — but that text can take the shape of an action instruction that the host program will recognize and execute on its behalf.
The trick comes down to two ingredients. First, you teach the model, in its system prompt, that it has access to tools: read a file, search the web, run code, etc. Second, the application watches what the model writes. When it produces a line that looks like a tool call — something like read_file("/data/report.txt") — the application intercepts it, actually runs the operation, and injects the result into the conversation. From the model's perspective, everything stays continuous text. From the application's perspective, it's the one doing the real work.
Here's what a full cycle looks like, in continuous text:
Action: read_file("/data/report.txt") Observation: The quarterly report shows a 12% revenue increase, an 8% drop in infrastructure costs, and three strategic recommendations [...4,200 tokens total...]
Reply: The report shows a 12% revenue increase, an 8% drop in costs, and three strategic recommendations for next quarter.
The model requests an action. The application performs it. The result comes back into context, the model sees it as if it had always known, and continues. This is the fundamental mechanic of modern assistants — Claude reading your Google Drive, ChatGPT searching the web, GitHub Copilot editing your code. Always the same loop: the AI asks, the app executes, the result returns to context.
The cost on the window
Lien vers la section The cost on the windowAll of this leaves a trace in the window, and every trace costs tokens. Reading a fifty-page file means dropping fifty pages into the window. Doing ten web searches means adding ten pages of results. That's why modern agents — the ones that chain actions on their own — can saturate their window surprisingly fast. And it's also the main subject of the next article, for anyone who wants to go deeper.
Three ideas that explain everything.
Lien vers la section Three ideas that explain everything.If you leave this page with three things in mind, let them be these. First — the AI reads tokens, not words, and everything it sees has to fit in a fixed-size window. Second — it does only one operation, predict the next token, in a loop that re-reads the input at every step. Third — it has no memory between requests: the application around it simulates continuity by resending the history on every turn, and actually executes the tools the model asks for.
With those three ideas, you can explain why your assistant forgets after a while, why a long document might "not fit," why the same model behaves differently from one product to another, and why an agent that consults a lot of sources can become slow or imprecise. Everything you read about the topic afterward — RAG, MCP, compaction, sub-agents — will be variations on these same constraints.
If you build with agents, the story continues.
This article lays the foundations. If you use Claude Code, Cursor, custom agents, or you design your own tools on top of these models, the context window becomes a resource you have to manage actively: arbitrating between system prompt, tools, history, operation results, and persistent memory.
The next article explores all of that in detail — the full toolkit of agent engineering, the phenomena that degrade quality, and the practical heuristics for staying below saturation.