Every hard problem in distributed systems eventually reduces to state. Who owns it, where it lives, how it stays consistent, what happens when it diverges. Agentic systems inherit all of these problems and add new ones, because the agent itself is stateless — it has no memory between calls — while the tasks it performs are often deeply stateful. Bridging that gap is where most of the real complexity lives.

Consider a multi-step task: the agent retrieves information, makes a decision, calls a tool, waits for a result, makes another decision. Each step depends on the results of previous ones. If the task fails halfway through — the tool times out, the context window fills, the user interrupts — you need to know what was completed, what wasn't, and whether it's safe to resume or necessary to restart. The agent can't tell you, because the agent doesn't remember. Your system has to.

The approaches are well-known from distributed systems: checkpointing state at each step, using event logs to reconstruct what happened, designing tasks to be resumable from any checkpoint. They're well-known because they're necessary — the same fundamental problem has been solved in different forms many times. The mistake is thinking that agentic systems are somehow different, that the conversational interface or the AI backbone changes the underlying state management challenge. It doesn't. The agent is just another stateless service that needs external state management to participate in stateful workflows.

What is different is that the state in agentic systems often includes things that are harder to serialize than database records. The agent's current understanding of a problem. The context it's been given. The implicit decisions it's made in the course of a long conversation. Capturing all of this in a way that lets you resume meaningfully — not just technically — requires thought about what actually needs to persist and what can be reconstructed.

The teams that handle this well design their state management before they design their agent logic. They ask: if this task is interrupted at any point, what do we need to resume it? They answer that question concretely and build the infrastructure to maintain it.

The agent forgets everything. Design as if that's a constraint, not an oversight.