Every output token your agent produces is at the premium rate of the strongest model in the loop. When the agent's job is to compress raw retrieved items into a paragraph, you are paying premium rates for compression. Here is how intent-based retrieval moves the synthesis to where the data lives, and what that does to the token bill.
The expensive agent's most expensive activity
Premium agents use premium models. Opus, GPT-5, the top of whatever family you are paying for. The reason you are paying for the top of the family is judgment: planning the next step, deciding what to do, knowing when to abort. That judgment is what the expensive agent is uniquely good at.
The expensive agent's most expensive activity, in practice, is not judgment. It is compressing raw retrieved data into a paragraph. The agent reads 20 memories, two documents, and three property values. Then it writes a one-paragraph synthesis. That paragraph costs premium output tokens. The compression is data work, not strategy, not judgment, not anything the expensive model is uniquely good at.
This is what most production agent code looks like today, and it is the reason "my agent costs are out of control" is almost always a retrieval architecture problem disguised as a model problem.
The expensive agent does not get cheaper by becoming smarter. It gets cheaper by doing less work that does not need its intelligence.Two shapes of retrieval call
There are two ways an agent can talk to a retrieval surface. The shape of the call decides who does the synthesis work.
Shape one is a query. "Give me data; I will figure out what to say with it."
The agent calls retrieval, gets back raw items, stuffs them into its own context, reasons over them in its own LLM at its own rate, and writes the answer. The retrieval surface is a lookup. The agent is the compressor.
Shape two is an intent. "Answer this for me, from what you know."
The agent passes the question. The retrieval surface pulls the four sources, runs the synthesis through its own internal LLM (which can be a cheaper tier because compaction is not strategy), and returns an answer with citations. The agent is the consumer of a synthesized result.
The naming is doing real work. A query implies "execute this lookup and give me the data." An intent implies "achieve this outcome and tell me what's true." The retrieval surface either passes the buck or delegates the work.
For an expensive agent, the choice between these shapes has a direct dollar cost per call, and a compounding context cost across an autonomous loop.
What delegation actually looks like
The intent shape, in the unified retrieve API, is brief mode. The agent passes a record and a question:
{
"mode": "brief",
"record": { "email": "jane@acme.com" },
"message": "what's the stage and any objections to watch?",
"qualityTier": "pro"
}What happens inside the retrieval system:
- Identity resolution finds the record.
- The four sources (properties, atomic memories, documents, graph edges, covered in One Call, Four Sources) get pulled in parallel.
- An internal LLM (typically a cheaper tier than the calling agent's model) composes a context block, attaches citation markers, and writes a grounded answer.
- The response is the answer plus the citations plus a sources-used list.
{
"answer": "Jane is qualified at $1.5M ARR. The most recent objection [M3]
is around vendor consolidation, raised on the October call.
[M1] confirms she is the technical buyer.
[D2] is the playbook we are following on consolidation conversations.",
"answerConfidence": "high",
"sourcesUsed": ["M1", "M3", "D2"],
"usage": {
"totalTokens": 3850,
"synthesisDurationMs": 1820,
"costUSD": 0.0145
}
}The agent receives a paragraph with citations. The agent did not have to read 20 memories. The agent did not have to write the compression. The synthesis happened inside the retrieval system, at a cheaper tier, with the right context window for the job.
If the agent needs to verify a claim, it cites [M3] and can call fetch mode to read that exact memory in detail. The full data is one cheap lookup away, but the agent only pays for it when it actually needs to look.
The token math, concretely
Take the question above and run it through both patterns.
Pattern A — query (agent does the synthesis).
- Retrieval returns 20 atomic memories + 4 document chunks ≈ 3,000 tokens of raw context.
- Agent ingests 3,000 input tokens at the premium model's input rate.
- Agent writes a 150-token synthesized paragraph at the premium model's output rate.
- Wall-clock: ~200ms retrieval + ~2,000ms agent LLM call ≈ 2.2 seconds.
Pattern B — intent (retrieval does the synthesis).
- Retrieval pulls the same 3,000 tokens internally.
- An internal LLM (cheap tier, e.g. a mini/haiku-class model) ingests those 3,000 tokens at the cheap rate and writes a 150-token answer at the cheap rate.
- Agent ingests a 250-token response (answer + citations + sources list) at the premium rate.
- Agent writes ~50 output tokens passing the result along.
- Wall-clock: one round-trip including internal LLM ≈ 1.5 seconds.
The premium token cost of Pattern B is roughly the cost of ingesting 250 tokens and writing 50. Pattern A's premium token cost is ingesting 3,000 tokens and writing 150. The ratio depends on the model's input-to-output rate spread, but for the common case (premium 5x cheaper than its output rate, cheap tier 10-20x cheaper still), Pattern B comes in at 5-10% of Pattern A's premium-tier cost per call.
That number does not include the latency win. Internal synthesis happens in the same hop as the data fetch. The agent is not making a separate LLM call to compress the data it just received. One round-trip, one wait, one usable result.
Why this compounds for autonomous agents
A single call's saving is interesting. The compounding across an autonomous loop is the actual story.
Suppose an autonomous agent runs 50 steps and calls retrieval at 30 of them. With Pattern A, each call dumps 3K tokens into the agent's context, and the agent's reply consumes 150 output tokens of premium-rate compression. Across 30 calls:
- 90K tokens added to agent context. The agent's effective reasoning capacity is now competing with 90K tokens of retrieved data that the model has to attend to on every subsequent decision.
- 4,500 output tokens of premium-rate compression work.
- 30 retrieval calls × 2.2 seconds = 66 seconds of latency, sequential.
With Pattern B, each call adds 250 tokens to the agent's context and ~50 output tokens of passing-through. Across the same 30 calls:
- 7.5K tokens added to agent context. The agent reasons over an order of magnitude less data, all of it already synthesized.
- 1,500 output tokens of premium-rate passing-along.
- 30 retrieval calls × 1.5 seconds = 45 seconds.
The expensive model's context capacity is the constraint that limits how long the agent can run. Pattern B buys back roughly 80K tokens of that capacity per 30-step loop. That is the difference between an agent that runs to completion and an agent that hits its context limit at step 38 and starts truncating its own working memory.
Intent-based retrieval is not a cost optimization for individual calls. It is a context-budget optimization that lets autonomous agents run longer before they exhaust themselves.And it does not re-deliver what you already have
Pattern B composes cleanly with the session coverage pattern from Retrieval Is a Conversation. When the agent says "give me more on this record," expand skips what was already delivered. When the agent says "answer this new question on the same record," brief synthesizes from what is now relevant without re-shipping previously seen items in raw form.
A loop that combines both looks like this: the first call is brief, the agent gets a synthesized answer; the agent decides it needs more detail on a specific point; it calls fetch for that one memory ID; the next "answer me this" question goes through brief again, with coverage carried in the session, so the synthesis is constructed from items the agent has not seen yet.
The agent at no point holds raw retrieved data in its premium context except the specific items it explicitly chose to inspect. Everything else is one synthesized paragraph at a time, with the IDs to escalate if needed.
When to use which
Both shapes are right. The discipline is picking the right one for the call.
Use brief (intent) when:
- The agent's next action is to reason from an answer, not from raw data.
- The downstream consumer of the agent's output is a human or another system that expects prose.
- The cost of having the premium agent compress raw items is greater than the cost of the internal synthesis. For most production questions on records, it is.
Use scout (query) when:
- The agent's logic depends on iterating over individual items (classifying each one, scoring each one, branching per item).
- The agent will join the retrieved data with non-record-scoped context that the retrieval surface cannot synthesize for it.
- The agent is building a decision tree, not writing a paragraph.
The default for production agents asking record-level questions should be brief. The default for agents doing data manipulation should be scout. Picking the mode is an engineering decision, not a defaulted convenience. The default is whichever pattern matches the actual question.
The principle
A query is a lookup. An intent is an outcome. When the agent expresses intent, the retrieval surface is allowed to do whatever combination of work serves it: filtering, ranking, compacting, citing, refusing. The agent is freed from being the data integration and compression layer.
If you are building autonomous agents on premium models, audit one thing first. How many output tokens does the agent spend per step on compressing raw retrieval data into prose? If the answer is "a lot," you are paying the premium model's rate for work that belongs at the retrieval layer.
Move the synthesis. Cite the sources. Keep the agent's expensive output budget for what only the agent can do, which is decide what to do next.
The retrieval system that returns a paragraph with citations is not doing less for you than the one that returns 20 memory chunks. It is doing more, in the place where doing it is cheap, so the place where doing it is expensive is freed up for the work that actually requires the expensive model.
Companion pieces: One Call, Four Sources on the unified payload that brief synthesizes over. Retrieval Is a Conversation on the session-coverage pattern that prevents brief from re-synthesizing over the same items across an autonomous loop.