We Replaced messages[] With steps[] in Our Agent API. Here's Why.

We started with the same messages[] pattern everyone uses for LLM APIs. For complex, repeatable agent workflows, it kept failing in predictable ways. So we decomposed instructions into sequential steps[] with scoped tools and shared context. Here's what we learned.

TL;DR

We built our agent API around messages[], the same pattern everyone uses. It works for chatbots. For multi-step workflows that need to run reliably, it kept breaking in ways we could predict but not prevent.
Our fix: replace a single complex prompt with an ordered sequence of steps[], each with its own prompt, tool scope, and success criteria. The LLM executes steps sequentially, accumulating context as it goes.
We got three benefits: complexity became manageable (each step is a bounded task), the LLM was forced to slow down and work methodically, and we got fine-grained control over which tools are available at which stage.
The result is multi-agent-like behavior from a single context window. Each step acts as a focused specialist, but all steps share the same accumulated state. No serialization, no handoff protocols, no conflicting memories.
We shipped this as POST /api/v1/responses with a steps[] parameter. This article explains what we learned.

What we kept hitting

Here's a task we needed to automate: research a company, find the right contact, draft a personalized outreach email, and self-evaluate the draft against brand guidelines.

Our first approach, like most teams, was to pack everything into one prompt:

You are a sales research agent. Research {company}, find the VP of
Engineering, look up their recent activity, draft a personalized
cold email following our brand voice guidelines, and then evaluate
whether the email meets our compliance rules. Output the email
subject, body, and a compliance score.

This works sometimes. When it fails, it fails in ways we started to recognize.

The LLM rushes. We watched models satisfy each sub-task minimally rather than thoroughly. The research step gets a paragraph. The contact lookup gets skipped if the model can infer a plausible name. The self-evaluation is a rubber stamp. The model optimizes for completion, not quality at each stage.

Tool usage is unpredictable. When the model has access to ten tools and a five-part instruction, it decides which tools to use, in what order, and how many times. Sometimes it calls web_search three times to build a thorough picture. Sometimes it calls it once and moves on. We had no control over the depth of each phase.

Debugging is opaque. When the output was wrong, we couldn't tell where the process broke down. Was the research bad? Was the contact lookup wrong? Did the model ignore the brand guidelines? The entire workflow is a single LLM call with a single output. There's no intermediate state to inspect.

We realized the root cause was architectural: a single instruction gives the LLM a single opportunity to plan, and it will always choose the shortest path through the plan.

What we did instead

Our approach was to decompose the instruction into discrete steps and execute them sequentially, each step building on the accumulated context of prior steps.

const result = await personize.responses.create({
  steps: [
    {
      prompt: 'Research {{company}}. Find their tech stack, recent funding, and key technical challenges.',
      tools: ['web_search', 'recall_pro']
    },
    {
      prompt: 'Find the VP of Engineering or equivalent technical leader. Look up their recent LinkedIn activity and publications.',
      tools: ['web_search']
    },
    {
      prompt: 'Draft a personalized cold email referencing specific details from your research. Follow the brand voice guidelines.',
    },
    {
      prompt: 'Evaluate the draft against compliance rules. Score 1-10. If below 7, revise and re-score.',
    }
  ],
  personize: {
    governance: { guideline_ids: ['brand-voice', 'outreach-compliance'] },
    memory: { record_id: 'company-123', recall: true }
  }
});

Each step is a bounded task. The LLM receives one instruction, executes it, produces output, and that output becomes part of the context for the next step. We control the sequence, the tool availability at each stage, and the granularity of each task.

This is not prompt chaining. Prompt chaining typically involves separate LLM calls with explicit input/output parsing between them. Instruction decomposition maintains a single, shared context window. Step 3 doesn't receive a serialized summary of steps 1 and 2. It sees the full conversation history: the original prompts, the tool calls, the raw results, the generated text. Nothing is lost in translation.

In practice

Here are two tasks from our own pipeline. The first is what convinced us.

Personalized sales email

The way most teams (including us, initially) approach this task is one large prompt:

POST /api/v1/responses
{
  "model": "claude-sonnet-4-6",
  "system": "You are a sales assistant.",
  "steps": [
    {
      "prompt": "Look up everything we know about sarah@acme.com (past interactions, job title, interests, pain points), check our brand voice guidelines and tone rules, think about what angle would resonate best with her, then write a personalized cold outreach email referencing her specific situation, using our brand voice, with a compelling subject line. Under 150 words."
    }
  ]
}

The problems with this:

The LLM has all tools available and decides the order; recall may never happen
No way to verify the recall step ran before the email was written
If the output is wrong, there is no intermediate state to inspect
Reasoning, recall results, and draft share the same token budget

The step-based version:

POST /api/v1/responses
{
  "model": "claude-sonnet-4-6",
  "system": "You are a sales assistant.",
  "steps": [
    {
      "prompt": "Recall everything we know about sarah@acme.com: job title, past interactions, pain points, interests, recent activity.",
      "tools": ["recall_pro"],
      "max_steps": 3
    },
    {
      "prompt": "Using the contact profile above, fetch our brand voice guidelines and identify the top 2 angles most likely to resonate with this person.",
      "tools": ["smart_guidelines"],
      "max_steps": 2
    },
    {
      "prompt": "Write a personalized outreach email using the profile and angles identified. Under 150 words.\n<output name=\"subject\">string</output>\n<output name=\"body\">string</output>",
      "tools": [],
      "max_steps": 1
    }
  ],
  "outputs": ["subject", "body"]
}

Response:

{
  "status": "completed",
  "outputs": {
    "subject": "Quick question about your Q2 pipeline, Sarah",
    "body": "Hi Sarah, noticed Acme recently expanded into..."
  },
  "steps": [
    { "step": 1, "tool_calls": 1, "tokens": 420 },
    { "step": 2, "tool_calls": 1, "tokens": 310 },
    { "step": 3, "tool_calls": 0, "tokens": 215 }
  ]
}

Step 3 has no tools. It works with what steps 1 and 2 produced. Recall is guaranteed to run before the draft is written.

Inbound lead qualification

Same structure, different task. A new lead submits a form: James Liu, VP Engineering at Horizon Robotics, 200-engineer team, looking for an AI memory layer.

The one-step version asks the model to recall prior history, check ICP criteria, score the lead, classify the tier, flag red flags, and draft a response email: all in one prompt. The model may draft the email before it has decided on the score. The tone of the email may not reflect the qualification.

The step version enforces the dependency:

{
  "steps": [
    {
      "prompt": "Check if we have any prior contact or history with Horizon Robotics or jliu@horizonrobotics.com.",
      "tools": ["recall_pro"],
      "max_steps": 2
    },
    {
      "prompt": "Fetch our ICP definition and qualification criteria. Score this lead 1-10 and classify as SMB / Mid-Market / Enterprise.\n<output name=\"score\">number</output>\n<output name=\"tier\">string</output>\n<output name=\"reason\">string</output>",
      "tools": ["smart_guidelines"],
      "max_steps": 2
    },
    {
      "prompt": "Given the score, tier, and any prior history above: write a first-response email to James Liu. Warm if score >= 7, neutral if 4-6, polite decline if < 4. Under 120 words.\n<output name=\"subject\">string</output>\n<output name=\"body\">string</output>",
      "tools": [],
      "max_steps": 1
    }
  ],
  "outputs": ["score", "tier", "reason", "subject", "body"]
}

Response:

{
  "outputs": {
    "score": 9,
    "tier": "Enterprise",
    "reason": "200-person eng team, VP-level buyer, explicit use case match, no prior contact",
    "subject": "Re: Your AI memory layer inquiry",
    "body": "Hi James, thanks for reaching out. With 200 engineers and an internal tooling focus..."
  }
}

The email tone is conditional on the score. The architecture enforces that ordering, not a comment in the prompt.

| | One-step prompt | Step-based | |---|---|---| | Tool scope | All tools, model decides | Scoped per step | | Recall guaranteed | Maybe | Enforced by step 1 | | Structured output | Parse free text | <output> tags | | Per-step debugging | Not possible | Per-step token counts | | Retry on failure | Restart everything | Retry one step |

The key insight: steps let you encode your task's dependency graph directly into the request.

Five properties we observed

1. Controlled complexity

A single prompt that says "research, find, draft, evaluate" gives the model a planning problem on top of the execution problem. It has to decide how to decompose the work, how much effort to allocate to each part, and when to move on.

With steps, the decomposition is done by the developer. The model never faces a planning problem. Each step is scoped: "do this one thing." The cognitive load per step is proportional to the complexity of that step, not the complexity of the entire workflow.

This matters because LLM reliability degrades with instruction complexity. A model that reliably follows a three-sentence instruction may fail unpredictably on a three-paragraph instruction. Decomposition keeps each instruction within the model's reliable operating range.

2. Tool scoping

Each step can specify which tools are available:

steps: [
  { prompt: 'Research the company', tools: ['web_search', 'recall_pro'] },
  { prompt: 'Draft the email' },  // no tools — pure generation
  { prompt: 'Evaluate compliance', tools: ['smart_guidelines'] }
]

Step 2 has no tools. The model can only generate text based on the context accumulated from step 1. It cannot make additional web searches, cannot look up more data, cannot call any external service. It works with what it has.

This is constraint as a feature. By removing tools from a generation step, you guarantee the model focuses on synthesis rather than additional data gathering. You prevent the common failure mode where the model, midway through drafting, decides it needs "just one more search" and derails the generation.

Tool scoping also reduces the attack surface for prompt injection. A tool-less step cannot be tricked into calling a dangerous function because no functions are available.

3. Forced pacing

The benefit we didn't expect was pacing. Steps force the LLM to slow down.

When a model receives "research, find, draft, evaluate" as a single instruction, it allocates its output budget across all four tasks. If the model's default output is 800 tokens, each sub-task might get 200 tokens of attention. The research is shallow. The evaluation is perfunctory.

With steps, each sub-task gets a full generation cycle. Step 1 can produce 800 tokens of research. Step 2 can produce 800 tokens of contact analysis. The model doesn't have to budget across tasks because it only sees one task at a time.

This is particularly important for evaluation steps. When self-evaluation is part of a larger instruction, models consistently under-invest in it. They generate the primary output and then append a brief "this looks good" assessment. When evaluation is its own step, with its own prompt, the model treats it as a first-class task. It actually evaluates.

4. Shared context without serialization

Multi-agent architectures solve the complexity problem by assigning each sub-task to a different agent. A research agent feeds into a drafting agent, which feeds into an evaluation agent. This works, but introduces a new problem: information loss at handoff boundaries.

Agent A's output has to be serialized, parsed, and injected into agent B's context. The rich intermediate state (which tool calls were made, what the raw results looked like, which search queries were tried and abandoned) is lost. Agent B sees a summary, not the full picture.

Step-based execution gives you multi-agent-like specialization without multi-agent information loss. Each step acts as a focused specialist. But all steps share the same context window. Step 3 can reference a specific line from step 1's web search results. Step 4 can evaluate whether step 3 actually incorporated the findings from step 1. The full history is available, uncompressed, unsummarized.

This is the fundamental architectural advantage: the behavior of multiple specialists with the context coherence of a single agent.

5. Observable intermediate state

Each step produces its own output, tool call log, and token usage:

{
  "steps": [
    {
      "order": 1,
      "text": "Acme Corp is a Series B startup building...",
      "tool_calls": [
        { "name": "web_search", "args": { "query": "Acme Corp tech stack" } },
        { "name": "recall_pro", "args": { "record_id": "acme-corp" } }
      ],
      "usage": { "prompt_tokens": 1200, "completion_tokens": 450 }
    },
    {
      "order": 2,
      "text": "The VP of Engineering is Sarah Chen...",
      "tool_calls": [{ "name": "web_search", "args": { "query": "Sarah Chen Acme Corp LinkedIn" } }],
      "usage": { "prompt_tokens": 1800, "completion_tokens": 380 }
    }
  ]
}

When the final email is wrong, you can inspect the intermediate state. Was the company research accurate? Did the contact lookup find the right person? Did the draft actually use the research? Each step is independently verifiable.

This also enables targeted debugging. If step 2 consistently finds the wrong contact, you can adjust step 2's prompt without touching the rest of the workflow. In a monolithic prompt, tuning one sub-task often destabilizes others.

Client-executed tools and the step loop

Steps interact naturally with client-executed tools: tools that run on the developer's infrastructure, not ours.

When the LLM calls a client tool during step execution, the step pauses. The tool call is returned to the developer's SDK. The SDK executes the tool locally (a database lookup, a CRM query, an internal API call), sends the result back, and step execution resumes.

const result = await personize.responses.create({
  steps: [
    { prompt: 'Look up this lead in our CRM and draft a follow-up' }
  ],
  tools: {
    crm_lookup: {
      description: 'Find a contact in our CRM',
      parameters: { type: 'object', properties: { email: { type: 'string' } } },
      execute: async (args) => {
        // This runs on YOUR server, not ours
        return await db.contacts.findByEmail(args.email);
      }
    }
  }
});

Under the hood, the SDK handles the loop: send request, receive requires_action with tool calls, execute locally, send results back, repeat until completion. The entire mechanism is stateless on the server side. No held connections, no WebSocket, no secondary endpoints. Just the same proven pattern that OpenAI established for function calling, extended to work across steps.

The key insight: the developer's compute handles tool execution, and Personize's compute handles reasoning and orchestration. You bring the tools, we bring the context.

When steps don't help

Steps aren't always the answer. We still use flat messages[] for two cases:

Simple, single-task prompts. "Summarize this document" or "Translate this paragraph" don't benefit from decomposition. There's one task, one output, nothing to sequence. Adding steps to a single-task prompt is overhead without value.

Genuinely exploratory tasks. Some tasks benefit from the model's ability to plan its own approach. Open-ended research where the model needs to follow unexpected threads, creative brainstorming where structure constrains the output. These tasks need autonomy, not guidance.

For everything in between, where you know the general shape of the workflow but need the model to execute each phase thoroughly, steps give you control without losing capability.

The architecture

Steps are implemented as a sequential loop over generateText calls, sharing an accumulated message history:

for each step:
    messages = [...accumulated_history, { role: 'user', content: step.prompt }]
    result = generateText({ model, messages, tools: step.tools })

    if result contains client tool call:
        return requires_action (client executes, sends result back)

    accumulated_history.push(step.prompt, result.text)

The orchestrator is a function, not a framework. It calls the AI SDK's generateText, accumulates results, and moves to the next step. No DAG engine, no state machine, no workflow definition language. Just a loop.

This simplicity is intentional. The orchestrator adds one thing that a raw LLM call doesn't: sequential structure with shared context. Everything else (tool execution, model selection, governance injection, memory recall) happens at layers that already exist.

The tradeoff

Steps trade LLM autonomy for developer control. The model can no longer decide its own workflow. It can't skip a step it considers unnecessary or add a step it thinks is missing. The developer has to get the decomposition right.

For us, this was the right tradeoff. In production, predictability matters more than adaptability. We want the model to reliably execute the workflow we designed, not creatively reinterpret it.

If you need the model to adapt, you can build that into individual steps. A step that says "evaluate the draft; if it scores below 7, revise and re-evaluate" gives the model local autonomy within a bounded scope. The step is a sandbox for creativity, not an open field.

What we shipped

The Responses API (POST /api/v1/responses) accepts a steps[] array. Each step has a prompt, optional tool scope, and optional max step count for tool loops within the step.

For developers who don't need step-based orchestration, the endpoint also accepts messages[] for a simpler, single-step experience. And POST /api/v1/chat/completions provides a standard OpenAI-compatible interface for drop-in migration.

Both endpoints support Personize's governance (SmartGuidelines), memory (recall and memorize), and BYOK (bring your own LLM key for direct provider access to OpenAI, Anthropic, Google, and others).

The step orchestrator, the tool splitter, and the client tool execution loop are all open components. The SDK handles the requires_action loop automatically, so from the developer's perspective, client-executed tools look like local function calls.

None of this is a new idea. Structured prompting, prompt chaining, and orchestration frameworks all explore aspects of it. What worked for us was the specific combination: sequential steps with shared context, scoped tool access, client-executed tools, and the observation that steps produce multi-agent behavior without multi-agent complexity. messages[] made sense when LLMs were chatbots. For agents doing repeatable work, we found that the instruction is the right unit of composition.