Provider batch APIs cut LLM extraction cost in half. Prompt caching cuts another fifteen to twenty percent off. Combined, customers who can wait 1-24 hours pay roughly a third of what they would on the sync path. Here is the architecture that makes the discount actually reach the customer, and why writes are the right place to spend the latency budget you saved at retrieval.
The headline, with the math behind it
Memorization is where the LLM cost lives. Extracting atomic facts, typed properties, and graph edges from a transcript costs more, per record, than every retrieval that record will produce for the next year. Production agent systems that look expensive in the bill are almost always paying for writes.
There are two places to push that cost down. Both have been sitting in plain sight, on every major provider's pricing page, for at least a year. We turned them both on for memorization.
The first is the provider batch API. Anthropic's Message Batches API, OpenAI's Batches API, and AWS Bedrock's CreateModelInvocationJob all offer the same deal: submit your requests in bulk, accept a delivery SLA of up to 24 hours, pay roughly half the per-token price. The discount is provider-funded, not negotiated. It is there for any caller who can wait.
The second is prompt caching. The extraction prompt is mostly static. The schema definitions, the extraction instructions, the few-shot examples, all of that is identical across thousands of memorize calls. Only the content payload changes. Bedrock prompt caching reads the static prefix once, caches it, and serves the next ten thousand calls against the cached prefix at a discount of around ninety percent on the cached portion. Net effect on the total extraction call: another fifteen to twenty percent off.
Stack them and you land at roughly 65% off the sync-path price for the same extraction quality, delivered in 1-24 hours instead of seconds.
The pricing was on the page. The architecture to actually reach it was the work.Where the 50% comes from
Each provider exposes its batch endpoint as a separate API surface from the chat completions endpoint. The shapes differ in detail but the contract is the same: you submit a batch of requests, you get back a job ID, you poll or wait for a webhook, you read the results when the job finishes.
Anthropic's Message Batches accepts up to ten thousand requests per batch with a 24-hour completion window and prices the input and output tokens at fifty percent of the sync API rate.
OpenAI's Batches accepts up to fifty thousand requests per batch, also with a 24-hour window, also at fifty percent of the sync rate.
Bedrock's CreateModelInvocationJob runs against an S3 input file containing many requests. Pricing varies by model but the discount sits in the same neighborhood for the models we use for extraction.
The catch with all three is the SLA. "24 hours" is a ceiling, not a target. In practice we see most batches finish in 30 minutes to four hours, but the variance is wide and you cannot promise a customer their write will be done by a specific minute. That makes batch the wrong choice for inline operations where an agent is waiting on the result. It makes batch the right choice for everything else: CRM imports, scheduled backfills, daily ingest pipelines, anything where the work is queued and the user is not.
Where the additional 15-20% comes from
Prompt caching works because extraction prompts are mostly the same prompt, called over and over with different content.
A schema-guided extraction call looks something like this. The system prompt contains the org's collection definitions, the property schemas with their type definitions and extraction instructions, the few-shot examples, the output format rules. That block is several thousand tokens. It is identical for every extraction call against that collection. Only the user content (the email, the transcript, the document body) changes.
Without caching, every call pays full input rate for the entire prompt. With caching, the static system prefix is written once to the provider's cache, then served at a heavily discounted rate (around ten percent of the standard input rate, on Bedrock) for the next five minutes. A high-throughput extraction worker burning through a thousand records in fifteen minutes pays the full rate exactly once, and the discounted rate 999 times.
The total cost reduction depends on the ratio of cached prompt tokens to content tokens. A schema with thirty typed properties, descriptions, instructions, and examples runs around 3,000 tokens of system prompt. A typical extraction content payload is 500-2000 tokens. The cache-eligible portion is between 60% and 85% of input tokens. At a ninety percent discount on the cached portion, that is fifteen to twenty percent off the total extraction cost.
Add it to the batch fifty percent and the two discounts compound: the cache discount applies inside the already-discounted batch rate, so the math works out to roughly 60-65% off the all-in sync-path cost.
How the system threads it through
The customer-facing surface is a single field. executionMode: 'bulk' on a memorize call routes through the batch path. Everything else is plumbing.
// Inline write (sync path, premium rate)
{ "content": "...", "email": "jane@acme.com", "executionMode": "sync" }
// Background write (async path, normal rate, minutes)
{ "content": "...", "email": "jane@acme.com", "executionMode": "async" }
// Batch write (bulk path, half-price, 1-24h)
{ "content": "...", "email": "jane@acme.com", "executionMode": "bulk" }What happens behind that flag, abbreviated:
- The API surface accepts the request and returns an
eventIdimmediately (202 Accepted). The caller is not blocked. - A Step Functions state machine queues the work.
- A Fargate dispatcher batches incoming work by provider, builds the provider-specific request format, and submits to the provider's batch API. The job ID is stored on the eventId for tracking.
- For Bedrock, an EventBridge rule catches the job-complete event. For Anthropic and OpenAI, a poller Lambda runs every five minutes and queries pending jobs.
- When a job completes, a finisher worker reads the results, applies the same Stage 2 logic the sync path uses (storage write, dual-write to LanceDB and DynamoDB, graph fan-out, dedup), and emits a webhook to the customer's configured endpoint.
- The customer's webhook handler treats the result the same way it would treat an async result, because that is what it is, just one to twenty-four hours later.
The Stage 2 logic is shared across sync, async, and bulk paths. Same dedup, same redaction tiers, same identity resolution, same graph channels. The only thing different between paths is the LLM call layer and the wait.
The same discount, for documents
The batch architecture is not memorize-specific. It is provider-specific. Any write that calls an LLM gets the discount when it routes through the bulk path.
That includes POST /api/v1.1/context/saveBatch, the v1.1-only async document writer. Documents with aiExtraction: true invoke the same extraction LLM that memorize uses. When the customer submits a batch of documents on the bulk path, the same provider batch API serves them, the same prompt caching applies (the system prompt for doc extraction overlaps heavily with memorize), and the same 1-24 hour SLA holds.
A customer importing five hundred playbook documents at one a.m. for an overnight onboarding run pays roughly a third of what the inline path would charge. The trade is the latency budget: they cannot use those docs in the morning standup if the batch is still in flight. For an overnight backfill, that is not a constraint.
The trade-off, made explicit
The three execution modes form a clean cost-versus-latency curve.
Sync, on the Lambda path, costs full rate and returns in sub-second to seconds. Use it when an agent is waiting on the result inside an instruction chain, when a user-facing UI needs the write to be visible immediately, when a webhook receiver is going to poll within thirty seconds.
Async, on the Fargate Step Functions path, costs full rate and returns in minutes. Use it when the result will be consumed by a webhook handler or a follow-up workflow, when the caller cannot afford to block the request thread but cannot wait hours either.
Bulk, on the provider batch path, costs roughly forty percent of sync rate (the fifty percent batch discount plus the cache discount) and returns in 1-24 hours. Use it for everything where the work is queued, CRM imports, scheduled backfills, daily ingest, monthly reprocessing, retraining set generation, bulk re-extraction after a schema change.
Picking the execution mode is an engineering decision about how the customer's downstream code consumes the result, not a quality decision. The extraction itself is identical across all three paths.The temptation, when you first see the discount, is to push everything to bulk. The reason not to is that bulk's variance ruins SLAs. A customer who runs a daily ingest at 6 a.m. expecting the writes to be visible by 9 a.m. lives in the world where bulk works. A customer whose product depends on a memory being readable within ninety seconds of being submitted does not.
Why writes are the right place to spend latency
The conventional wisdom on AI cost is to optimize the LLM call, pick a cheaper model, shorten the prompt, cache aggressively. Memorization is where that wisdom translates most directly into dollars, because writes are the cost-dominant operation in a production memory system.
A single memorize call does extraction. Extraction is a multi-thousand-token LLM call that produces typed properties, atomic memories, and optionally graph edges. A single retrieve call against the same record reads from PostgreSQL and a vector store. Even on the brief mode where retrieval invokes its own (cheap) synthesis LLM, the cost is a fraction of what extraction costs.
For most production systems we have measured, the ratio is something like 50:1, fifty times more LLM cost is spent extracting a memory than is spent serving every retrieval against it across its lifetime. Pushing that 50 down by 65% is not a margin improvement. It is the headline number on the bill.
The latency cost of moving extraction to the batch path is paid by the system, not the user. The user's downstream workflow already runs on the time scale of "minutes to hours after a CRM event" for most use cases. Pushing the extraction inside that envelope, instead of inside a five-second sync window, costs the user nothing they were going to use.
The retrieval side stays fast. The unified retrieve trilogy (retrieval as conversation, four sources in one call, intent-based delegation) covers the read-path optimizations that compound across thousands of agent steps. The write path's job is to be correct, to be governed, and to be cheap enough that customers can write as much as they need to. Cheap enough is what the batch + cache stack delivers.
The principle
Latency is the resource you trade for cost in LLM systems. Sync paths spend money to buy seconds. Bulk paths spend hours to buy back the money. The decision is not which is right; both are right. The decision is per-write, per workflow, per customer.
What makes the stack work is that the user-facing surface is one flag. The customer does not write code against three different APIs. They submit the same memorize or context-save call and add executionMode: 'bulk' when they can wait. The architecture beneath (Fargate workers, Step Functions, provider batch endpoints, poller Lambdas, EventBridge) is the engineering team's problem, not the customer's.
When the provider published the batch pricing, the discount was theoretical. It existed only for callers willing to wire up to a different API, handle a different response shape, manage their own polling, and reconcile results to their existing storage. Most customers were not going to do that. Most customers also did not need to pay full price for writes that did not have to be inline. The job of the platform is to make the discount reachable through the same call the customer was already making.
If you are building on top of memorize-style writes and your bill is dominated by extraction, audit one thing first. How much of your write traffic is genuinely inline-required, and how much is queued work that already runs on the hour or the day? The queued portion is sitting on the table at a third of what you are paying. Move it.
Companion pieces: Retrieval Is a Conversation, One Call, Four Sources, and Think and Execute For Me cover the read-path optimizations that pair with this write-path discount.