Instructions, Not Prompts: How We Run 1,000 Agents Per Minute Without the Bill Going Sideways

Per-token pricing punishes you for using AI at scale. Per-minute pricing forces you to design agents like production systems. Here is the pattern we converged on, the trade-offs of one big chain versus a forest of small agents, and what changes when models are designed for workflows from the start.

The unit of work changed under us

Three years ago, the unit of AI work was a prompt. One question, one answer, one bill measured in tokens. Most of us were still pasting paragraphs into a chat box.

The unit today is something else. A real production agent in our system is not a prompt. It is a chain of instructions, where each instruction is a separate LLM call with its own context window, its own tool budget, its own abort path. The model is not "an assistant". It is one step in a deterministic outer loop that you control.

When we sold tokens, customers were paying for words. When we ran the math on what they were actually buying, what they were buying was minutes of supervised reasoning across thousands of records. We changed the unit on the bill to match the unit they were actually consuming.

This article is about what that shift made obvious. If you are designing for one prompt at a time, you are designing for the chat era. The production unit now is the instruction chain, and the resource it consumes is operator time.

What an instruction chain actually looks like

Take a sales agent that reads inbound replies. The chat-era version is a single prompt: "Read this email and decide what to do." You feed it the email, hope for a structured response, move on.

The production version is three instructions. Each is its own LLM round-trip.

{
  "instructions": [
    "Look up everything we know about this contact. Search for recent role changes and funding news. Summarize what you found in plain prose. Do not emit output markers yet.",
 
    "Based on the summary above, identify the two strongest signals to act on. If you cannot identify even one verified signal, emit <abort reason='insufficient_signal'>...</abort> instead. Do not invent signals.",
 
    "Compose a short outreach reply (≤ 80 words) personalized to those signals. Emit the required outputs. Use only facts established above."
  ],
  "outputs": [
    { "name": "reply_draft",  "required": true },
    { "name": "top_signal",   "required": true },
    { "name": "uncertainty",  "required": false }
  ],
  "memorize": { "email": "alice@acme.com", "type": "Contact" },
  "agentTools": true,
  "tier": "pro"
}

Anatomy of an instruction chain: parallel pre-flight auto-retrieval (SmartContext + SmartRecall), three sequential LLM calls each running their own tool-call steps, an abort gate in the middle instruction, and outputs auto-memorized to the record on success

What is hidden here matters more than what is visible.

Before the first instruction even runs, the framework pulls two things in parallel: SmartContext (the org's guidelines, playbooks, tone rules) and SmartRecall (everything we have learned about Alice from prior calls, emails, and meetings). Both get injected into the system prompt. The model sees the org's knowledge and Alice's history before it sees your instruction.

Each instruction then runs as its own LLM call. The model can take ten steps inside one instruction: search, think, call a tool, think again, call another tool, draft, revise, finalize. The chain has three instructions. The total step count is whatever the model needed inside each one, summed.

If instruction 2 emits the abort marker, instruction 3 does not run. We do not write a half-baked reply to the record. We do not bill for work we will not act on. The abort is a feature, not an error.

That last point is the one that takes the longest to learn. The most important thing a production agent can do is refuse to act on bad data.

Why per-minute pricing forced the discipline

When we billed per token, here is what customers actually experienced.

A normal prompt cost a fraction of a cent. A tool-calling loop with three retries cost ten times that. A research agent that decided it needed "more context" and looped through fifteen tool calls cost a hundred times the original. The variance was the cost.

Per-token pricing is mathematically honest and operationally hostile. It punishes you for using the system at scale, because at scale the variance is what shows up on the bill.

We moved to per-minute pricing because that is the actual resource being consumed. An instruction chain that takes 12 seconds to run uses 12 seconds of orchestrated reasoning. The bill is observable. The scaling is linear. The forecast for running 50,000 records is (seconds per run) × 50,000 / 60 = budget.

This changed how we built things. A few examples.

Instruction count became a budget conversation. The authoring guide says 2 to 4 instructions is usually right and 6+ is a smell. That was advice. Per-minute pricing made it operational. Every instruction has its own LLM round-trip latency. If your chain has 7 instructions averaging 3 seconds each, you are 21 seconds in before the model has spoken its final word. 21 seconds × 50,000 records = 291 hours of compute time you will pay for whether the run was useful or not.

Tool budgets became visible. If a single research call adds 4 seconds, the question becomes "is that 4 seconds worth it across 50,000 records?" Not in tokens. In minutes. In dollars.

Tuning became a real engineering exercise. "Can I get instruction 2 from 6 seconds down to 3?" is a question with a stopwatch in it. You can pull the same insight from token counts, but it lands differently. People optimize what they can see on a wall clock.

The shift from tokens to minutes is not a pricing trick. It is a forcing function. You cannot design an agent for production until the resource it consumes is the resource you are billed for.

The design pattern: tune on 50, run on 50,000

The thing nobody tells you about production agents is that they are designed iteratively, against real records, on a small batch. The authoring workflow looks like this.

Write the first draft of the instruction chain for one record. Run it. Read the output.
Add the chain to a script. Run it against 50 records of the type you care about: 50 inbound leads, 50 support tickets, 50 reply emails.
Read every output. Mark the failures. Look at where the model invented a signal, mislabeled a sentiment, or wrote a reply that referenced a fact it had no source for.
Adjust. Maybe instruction 2 needed an explicit abort condition. Maybe instruction 1 was asking the model to "first recall what we know" when SmartRecall already did that for free. Maybe agentTools was off and the model was substituting plausible-sounding facts in place of a research call.
Re-run on the same 50. Compare. Iterate until the failure mode is one you can live with.
Put it on autopilot for the other 49,950.

The pattern is small-batch design followed by scaled execution. It is closer to manufacturing process design than to writing a prompt. You are not asking "what is the perfect wording?" You are asking "what failure rate is acceptable on the run that processes everything?"

The pricing shape supports this. Tuning on 50 costs you a few dollars. Running on 50,000 with a tuned chain is predictable per-minute spend you can forecast a month ahead.

The mistake we still see most often is people skipping the tuning batch and going straight to production. The instruction chain works on three records the engineer hand-picked. It falls over on the fifty real records that came in this morning. The fix is not a better model. The fix is the loop.

One big chain versus a forest of small agents

Once you have the pattern, you have a design question. Do you build one large agent with a long instruction chain, or do you build a forest of small agents that coordinate?

Both work. The trade-off is real.

One big chain (10 to 100 instructions). Everything happens in one execution. The chatHistory carries forward. You can reason about the full sequence top to bottom. Failure modes are concentrated: when something goes wrong, you know which instruction. Cost is one billed run per record.

This is the right pattern when the steps genuinely need each other's outputs and the whole thing wants to be one transaction. A research pipeline that gathers, scores, reasons, and writes is a good one-chain candidate.

A forest of small agents (2 or 3 instructions each, sharing memory). You break the problem into N agents, each with a narrow job. They communicate through a shared memory layer that any of them can write to and any of them can read from. One agent enriches the record. A second classifies intent. A third drafts the reply. A fourth reviews it against compliance rules.

This is the right pattern when the jobs are independent enough to parallelize, when different jobs run on different schedules (enrichment on Sunday, classification when a reply arrives, drafting when a sequence triggers), or when you want different agents owned by different teams.

The thing that makes the forest pattern actually work is the shared memory layer. Without it, you have N agents talking past each other. With it, every agent reads from and writes to the same governed memory about the record. Agent 3 sees what agent 1 wrote. Agent 4 sees what agent 2 wrote. The coordination is the data, not the message bus.

We use both patterns in the same product. The inbound reply handler is a single chain because the steps depend on each other. The account intelligence pipeline is a forest because the jobs come in on different triggers and want to be owned by different teams.

The rough decision rule we landed on:

If the steps are sequential and one transaction makes sense, use one chain.
If the steps are independent triggers and want to be owned separately, use a forest with shared memory.
If you are not sure, start with one chain and split when the chain is doing two jobs.

What changes when the model knows it is in a workflow

Last week Anthropic announced Opus 4.8 with explicit workflow primitives. We will not see the full picture for another couple of months, but the direction is clear and it is the same direction the production data has been pointing in for a year.

Models are getting better at being one step in a chain, not the whole show.

The pattern that emerged from running production agents is that the model does its best work when each call has one job, the context is bounded, and the failure mode is something the surrounding code can catch. The model is great at classification, generation, judgment. It is bad at remembering to do six things in the right order on every retry. Code is great at order. The two split the labor.

A workflow-aware model amplifies this. Native support for sub-steps inside an instruction (the AI SDK calls these "tool-call rounds") means less framework code in the middle. Better caching of the static system prefix across instructions in a chain means the second and third instructions get the ~90% prompt-cache discount automatically. Improved instruction-following at the boundary means the model is more likely to honor your abort markers and structured output schemas without restating the rules in every instruction.

The implication: the gap between "I called a model" and "I ran an agent" is closing from the model side. The production discipline still has to come from you.

What this pattern actually buys you

After running this pattern for a year, here is what is true in our experience.

Predictability. Running 50,000 records on a tuned chain costs within 10% of the forecast every time. The variance is in records that triggered abort, not in tokens.

Scalability. The bottleneck is no longer "what does the model do under load". It is queue depth and provider rate limits, both of which are engineering problems with known shapes.

Auditability. Every run has an event ID. Every event has the instructions executed, the steps taken, the tools called, the outputs emitted, the abort reason (if any), and the memory writes that resulted. When someone asks "why did the system do that?", we point at the event and walk through it.

Optimizability. When a chain is slow or expensive, we have a stopwatch on each instruction. We know which one to cut. We know what we are paying for.

None of this is unique to us. Stripe's Minions architecture, Shopify's Sidekick, and Salesforce's hybrid-reasoning approach are reaching similar conclusions from different starting points. The convergence on instruction-chain patterns with shared memory is what production AI looks like now.

The honest part

This pattern is more disciplined than chat-era prompting. You spend more time designing the chain. You write more abort conditions than feels natural. You tune against a real batch before you turn on autopilot. The first agent you build this way will feel slow to ship.

The payoff arrives at the scale point. The first agent is slower to write. The fiftieth is faster, cheaper, and more predictable than any single-prompt version could ever be.

If you are still designing one prompt at a time and wondering why the production bill is double the forecast, the answer is probably not the model. The answer is that you are using a chat-era tool for production-era work.

The fix is not to upgrade to a better model. The fix is to change the unit.

References: Anthropic Opus 4.8 workflow announcement (May 2026), Stripe Minions Architecture, Shopify Engineering: Production Agentic Systems, Personize multi-step instructions authoring guide (internal).