Stateless retrieval is fine for chat. For autonomous agents taking 100 steps with no human in the loop, it is the single biggest source of wasted output tokens, context bloat, and silent quality collapse. Here is the pattern that replaces every-call-is-fresh with retrieval-as-conversation.
The 100-step problem
A chat assistant retrieves once per user turn. The user asks, the system pulls context, the model responds, the human reads, the human types the next message. Each turn is a fresh question with a fresh retrieval. The state lives in the user's head.
An autonomous agent is a different machine. It takes 30, 60, 100, sometimes several hundred steps without anyone watching. It calls retrieval inside a loop. The loop reads the result, decides what to do, calls retrieval again. There is no human re-framing the query. There is no human noticing that the fourth retrieval returned the same six memories as the first.
The chat-era retrieval API was built for the chat-era pattern. Every call is independent. Every call returns the highest-ranked results for its query. When two calls look similar to the ranker, and similar queries from an autonomous agent often do, the second call returns the same payload as the first.
The cost of this is invisible in chat and crushing in autonomous loops.
For an agent taking N retrieval steps without supervision, stateless retrieval is not a performance issue. It is the dominant failure mode.What stateless retrieval actually costs
Three costs compound. They are all invisible to the engineer until the agent's output starts to degrade.
Output tokens spent on bookkeeping
When the same memory appears in retrieval result 1 and retrieval result 4, the agent has to do something with it the second time. It typically explains to itself, in its own context, why it already considered this. "I have already reviewed this memory from the prior step, where I noted..." That is output tokens the model is paying full per-token rate to produce, and they are all addressed to itself.
A research agent we instrumented spent 23% of its total output tokens on internal commentary about duplicate content across retrievals. Not on reasoning. On bookkeeping.
Context bloat
Every memory that comes back fills slots in the agent's context window. When 60% of the payload of retrieval call 4 is content already present from calls 1 through 3, the model's effective context capacity is shrinking even though its raw context capacity is not. By step 50, the agent is reasoning over a context window that is mostly repeated.
This compounds with the bookkeeping problem. The repeated memories are not just present; the agent's own self-talk about them is also taking up slots.
Silent quality collapse
When the agent keeps seeing the same data, it starts to over-weight it. The third time a memory appears, the model treats it as more important than a memory that appeared once. This is not a bug in the model. It is the model correctly responding to a signal that the retrieval surface fabricated.
None of these costs show up on a single retrieval call. They show up after step 30, after step 50, after step 100. The agent that worked beautifully on the first run begins to spend more on bookkeeping than on work, and the engineer who built it has no instrumentation that points at the cause.
What session coverage does
The fix is not better ranking. The fix is a different shape of API.
When the agent's first retrieve call anchors on a record (a contact, a company, a deal, a project), the system records what was delivered: which memory IDs, which document IDs, which graph edges, which property snapshots. This goes into a session, anchored to the record, with a short TTL.
On the next call, the agent says "expand from this session." The system reads the previous coverage. It re-runs retrieval against the same sources, but it skips the items already delivered. The agent receives the next page automatically, deduped against everything it has already seen.
// First call: anchor on record, get the first page
{
"mode": "scout",
"record": { "email": "jane@acme.com" }
}
// Response includes:
// state.sessionId = "sess_abc"
// state.coverage = { memories: 25, documents: 5, graph: 8 }
// state.exhausted = { memories: false, documents: false, graph: false }
// Later in the agent loop: "give me more"
{
"mode": "expand",
"continueFrom": "sess_abc"
}
// Response: next page from each non-exhausted source.
// Zero overlap with the first page. No client-side dedup required.The session is anchored to the record, not to the query string. This matters. The agent in step 12 might ask a slightly different question than it asked in step 5. The retrieval surface still knows what it has already shown the agent about that record, and it does not re-show it.
When all sources for a record are exhausted, state.exhausted reports it source by source. The agent knows there is nothing more to fetch, and it can move on instead of looping. An agent that can detect "I have already learned everything available about this entity" stops asking the question. That single signal saves more tokens than any ranking improvement.
The token math, concretely
Take a research agent doing 30 retrieval calls against the same contact during one autonomous run. Assume each call returns 8 items, and the agent spends roughly 50 output tokens internally processing each item it receives.
Without session coverage, the ranker returns broadly the same top items each time. Past step 5, roughly 60% of each payload is content the agent has seen. The 30 calls collectively produce 240 items. Of those, around 96 are unique. The remaining 144 are repeats the agent has to internally process and dismiss.
144 duplicates × 50 output tokens each = 7,200 wasted output tokens per agent run, every run.At current Opus output rates, that is roughly eleven cents per run spent on bookkeeping the agent did not need to do. Run that agent 50,000 times in a month and you are paying $5,500 to repeatedly process content the system already delivered.
With session coverage, the 30 calls return 96 items total, all unique. No internal commentary about duplicates. No context-window bloat from repeats. The agent's effective reasoning capacity per step is what the model's context budget actually offers, not 40% of it.
The dollar number is real but it is not the headline. The headline is that the agent reasons better when its context window is not full of its own bookkeeping.
Why "better ranking" does not solve this
The instinct, when an engineer first sees the duplicate problem, is to improve the ranker. Add diversity penalties. Add recency boosts. Tune MMR. None of these solve the problem because they cannot solve it: the ranker has no idea what it has already returned to this agent in this run.
A diversity penalty applied within a single call cannot dedupe across calls. A recency boost cannot tell that the agent already received the recent items in step 1. The information the ranker needs to act on is held outside the call. Coverage is that information, made explicit.
This is also why client-side dedup is not enough. If the agent maintains a "seen memory IDs" set and filters the next response client-side, the call still travels the full payload, still consumes tokens being parsed into the agent's context, and the dropped items still cost CPU and network. Coverage at the API level skips the work entirely, server-side, before the bytes leave the box.
When to expand, when to reset
Two operational rules cover most cases.
Expand when the agent stays on the same entity. If your agent is working through what it knows about Jane at Acme, every subsequent retrieval call should continue from the previous session. The system will paginate cleanly until exhaustion.
Reset when the agent switches focus. When the agent moves from Jane at Acme to the renewal process for the broader account, that is a new question with a new anchor. Start a new scout call. The previous session was anchored to Jane and is no longer relevant.
Implementations that get this wrong typically default to "always reset" (the chat pattern applied uniformly) or "always expand" (the agent tries to continue a session that no longer matches the question). Neither is right. The agent code has to know which mode of conversation it is in.
The good news is that for most autonomous loops, the answer is obvious from the loop variable. A loop over entities resets the session every iteration. A loop within an entity expands. The decision is structural, not semantic.
What it really buys you
A retrieval surface that knows what it has already shown the agent is not a feature. It is the precondition for agents that run unsupervised for 100+ steps. Without it, the agent spends more on remembering than on thinking, and the failure mode is invisible because every individual call returns a perfectly valid result.
The frame I have started using when reviewing agent designs is this. Look at the retrieval calls in the autonomous loop. If two of them could plausibly return overlapping payloads, and the agent has no way to tell the system "skip what you have already shown me," you have a token budget problem that gets worse every time the agent gets smarter.
Retrieval is a conversation when the agent is the one having it. The chat-era assumption that every call is fresh dies the first time you put an agent in a loop with no human in it.The mechanism is simple. Anchor on the entity. Track what was delivered. Skip it on the next call. Report when the source is exhausted. The implementation is a session ID and a coverage map. The payoff is autonomous agents that can run for hundreds of steps without their own context becoming their largest cost center.
If you are designing an agent that calls retrieval more than a handful of times per record, ask which two costs you would rather pay: the engineering cost of session-aware retrieval, or the token cost of a model that re-reads itself for the next thousand runs. The math gets worse as the agent gets longer-running, and the agents are getting longer-running.
Companion pieces: "Instructions, Not Prompts" on the per-minute economics of instruction-chain agents.