The Case for Instruction Decomposition in Agentic AI Systems

A single prompt gives the LLM too much freedom. Decomposing instructions into sequential steps creates guided agency: multi-agent-like behavior from a single context window, with controllable complexity and shared state.

TL;DR

The dominant pattern for AI agent APIs is messages[]: a flat conversation history where everything the agent needs is packed into one system prompt or user message. This works for chatbots. It fails for workflows.
Instruction decomposition replaces a single complex prompt with an ordered sequence of steps, each with its own prompt, tool scope, and success criteria. The LLM executes steps sequentially, accumulating context as it goes.
This produces three benefits: complexity becomes manageable (each step is a bounded task), the LLM is forced to slow down and work methodically, and the developer gets fine-grained control over which tools are available at which stage.
The result is multi-agent-like behavior from a single context window. Each step acts as a focused specialist, but all steps share the same accumulated state. No serialization, no handoff protocols, no conflicting memories.
We shipped this as POST /api/v1/responses with a steps[] parameter. This article explains why.

The problem with a single instruction

Consider a common agentic task: research a company, find the right contact, draft a personalized outreach email, and self-evaluate the draft against brand guidelines.

The message-based approach packs everything into one prompt:

You are a sales research agent. Research {company}, find the VP of
Engineering, look up their recent activity, draft a personalized
cold email following our brand voice guidelines, and then evaluate
whether the email meets our compliance rules. Output the email
subject, body, and a compliance score.

This works sometimes. When it fails, it fails in characteristic ways.

The LLM rushes. Given a complex instruction, models tend to satisfy each sub-task minimally rather than thoroughly. The research step gets a paragraph. The contact lookup gets skipped if the model can infer a plausible name. The self-evaluation is a rubber stamp. The model is optimizing for completion, not quality at each stage.

Tool usage is unpredictable. When the model has access to ten tools and a five-part instruction, it decides which tools to use, in what order, and how many times. Sometimes it calls web_search three times to build a thorough picture. Sometimes it calls it once and moves on. The developer has no control over the depth of each phase.

Debugging is opaque. When the output is wrong, you can't tell where the process broke down. Was the research bad? Was the contact lookup wrong? Did the model ignore the brand guidelines? The entire workflow is a single LLM call with a single output. There's no intermediate state to inspect.

The root cause is architectural: a single instruction gives the LLM a single opportunity to plan, and it will always choose the shortest path through the plan.

Instruction decomposition

The alternative is to decompose the instruction into discrete steps and execute them sequentially, each step building on the accumulated context of prior steps.

const result = await personize.responses.create({
  steps: [
    {
      prompt: 'Research {{company}}. Find their tech stack, recent funding, and key technical challenges.',
      tools: ['web_search', 'recall_pro']
    },
    {
      prompt: 'Find the VP of Engineering or equivalent technical leader. Look up their recent LinkedIn activity and publications.',
      tools: ['web_search']
    },
    {
      prompt: 'Draft a personalized cold email referencing specific details from your research. Follow the brand voice guidelines.',
    },
    {
      prompt: 'Evaluate the draft against compliance rules. Score 1-10. If below 7, revise and re-score.',
    }
  ],
  personize: {
    governance: { guideline_ids: ['brand-voice', 'outreach-compliance'] },
    memory: { record_id: 'company-123', recall: true }
  }
});

Each step is a bounded task. The LLM receives one instruction, executes it, produces output, and that output becomes part of the context for the next step. The developer controls the sequence, the tool availability at each stage, and the granularity of each task.

This is not prompt chaining. Prompt chaining typically involves separate LLM calls with explicit input/output parsing between them. Instruction decomposition maintains a single, shared context window. Step 3 doesn't receive a serialized summary of steps 1 and 2. It sees the full conversation history: the original prompts, the tool calls, the raw results, the generated text. Nothing is lost in translation.

Five properties of step-based execution

1. Controlled complexity

A single prompt that says "research, find, draft, evaluate" gives the model a planning problem on top of the execution problem. It has to decide how to decompose the work, how much effort to allocate to each part, and when to move on.

With steps, the decomposition is done by the developer. The model never faces a planning problem. Each step is scoped: "do this one thing." The cognitive load per step is proportional to the complexity of that step, not the complexity of the entire workflow.

This matters because LLM reliability degrades with instruction complexity. A model that reliably follows a three-sentence instruction may fail unpredictably on a three-paragraph instruction. Decomposition keeps each instruction within the model's reliable operating range.

2. Tool scoping

Each step can specify which tools are available:

steps: [
  { prompt: 'Research the company', tools: ['web_search', 'recall_pro'] },
  { prompt: 'Draft the email' },  // no tools — pure generation
  { prompt: 'Evaluate compliance', tools: ['smart_guidelines'] }
]

Step 2 has no tools. The model can only generate text based on the context accumulated from step 1. It cannot make additional web searches, cannot look up more data, cannot call any external service. It works with what it has.

This is constraint as a feature. By removing tools from a generation step, you guarantee the model focuses on synthesis rather than additional data gathering. You prevent the common failure mode where the model, midway through drafting, decides it needs "just one more search" and derails the generation.

Tool scoping also reduces the attack surface for prompt injection. A tool-less step cannot be tricked into calling a dangerous function because no functions are available.

3. Forced pacing

The most underappreciated benefit of step-based execution is pacing. Steps force the LLM to slow down.

When a model receives "research, find, draft, evaluate" as a single instruction, it allocates its output budget across all four tasks. If the model's default output is 800 tokens, each sub-task might get 200 tokens of attention. The research is shallow. The evaluation is perfunctory.

With steps, each sub-task gets a full generation cycle. Step 1 can produce 800 tokens of research. Step 2 can produce 800 tokens of contact analysis. The model doesn't have to budget across tasks because it only sees one task at a time.

This is particularly important for evaluation steps. When self-evaluation is part of a larger instruction, models consistently under-invest in it. They generate the primary output and then append a brief "this looks good" assessment. When evaluation is its own step, with its own prompt, the model treats it as a first-class task. It actually evaluates.

4. Shared context without serialization

Multi-agent architectures solve the complexity problem by assigning each sub-task to a different agent. A research agent feeds into a drafting agent, which feeds into an evaluation agent. This works, but introduces a new problem: information loss at handoff boundaries.

Agent A's output has to be serialized, parsed, and injected into agent B's context. The rich intermediate state (which tool calls were made, what the raw results looked like, which search queries were tried and abandoned) is lost. Agent B sees a summary, not the full picture.

Step-based execution gives you multi-agent-like specialization without multi-agent information loss. Each step acts as a focused specialist. But all steps share the same context window. Step 3 can reference a specific line from step 1's web search results. Step 4 can evaluate whether step 3 actually incorporated the findings from step 1. The full history is available, uncompressed, unsummarized.

This is the fundamental architectural advantage: the behavior of multiple specialists with the context coherence of a single agent.

5. Observable intermediate state

Each step produces its own output, tool call log, and token usage:

{
  "steps": [
    {
      "order": 1,
      "text": "Acme Corp is a Series B startup building...",
      "tool_calls": [
        { "name": "web_search", "args": { "query": "Acme Corp tech stack" } },
        { "name": "recall_pro", "args": { "record_id": "acme-corp" } }
      ],
      "usage": { "prompt_tokens": 1200, "completion_tokens": 450 }
    },
    {
      "order": 2,
      "text": "The VP of Engineering is Sarah Chen...",
      "tool_calls": [{ "name": "web_search", "args": { "query": "Sarah Chen Acme Corp LinkedIn" } }],
      "usage": { "prompt_tokens": 1800, "completion_tokens": 380 }
    }
  ]
}

When the final email is wrong, you can inspect the intermediate state. Was the company research accurate? Did the contact lookup find the right person? Did the draft actually use the research? Each step is independently verifiable.

This also enables targeted debugging. If step 2 consistently finds the wrong contact, you can adjust step 2's prompt without touching the rest of the workflow. In a monolithic prompt, tuning one sub-task often destabilizes others.

Client-executed tools and the step loop

Steps interact naturally with client-executed tools: tools that run on the developer's infrastructure, not ours.

When the LLM calls a client tool during step execution, the step pauses. The tool call is returned to the developer's SDK. The SDK executes the tool locally (a database lookup, a CRM query, an internal API call), sends the result back, and step execution resumes.

const result = await personize.responses.create({
  steps: [
    { prompt: 'Look up this lead in our CRM and draft a follow-up' }
  ],
  tools: {
    crm_lookup: {
      description: 'Find a contact in our CRM',
      parameters: { type: 'object', properties: { email: { type: 'string' } } },
      execute: async (args) => {
        // This runs on YOUR server, not ours
        return await db.contacts.findByEmail(args.email);
      }
    }
  }
});

Under the hood, the SDK handles the loop: send request, receive requires_action with tool calls, execute locally, send results back, repeat until completion. The entire mechanism is stateless on the server side. No held connections, no WebSocket, no secondary endpoints. Just the same proven pattern that OpenAI established for function calling, extended to work across steps.

The key insight: the developer's compute handles tool execution, and Personize's compute handles reasoning and orchestration. You bring the tools, we bring the context.

When steps don't help

Steps are not universally better than a single instruction. Two cases where a flat message-based approach is sufficient:

Simple, single-task prompts. "Summarize this document" or "Translate this paragraph" don't benefit from decomposition. There's one task, one output, nothing to sequence. Adding steps to a single-task prompt is overhead without value.

Genuinely exploratory tasks. Some tasks benefit from the model's ability to plan its own approach. Open-ended research where the model needs to follow unexpected threads, creative brainstorming where structure constrains the output. These tasks need autonomy, not guidance.

For everything in between, where you know the general shape of the workflow but need the model to execute each phase thoroughly, steps give you control without losing capability.

The architecture

Steps are implemented as a sequential loop over generateText calls, sharing an accumulated message history:

for each step:
    messages = [...accumulated_history, { role: 'user', content: step.prompt }]
    result = generateText({ model, messages, tools: step.tools })

    if result contains client tool call:
        return requires_action (client executes, sends result back)

    accumulated_history.push(step.prompt, result.text)

The orchestrator is a function, not a framework. It calls the AI SDK's generateText, accumulates results, and moves to the next step. No DAG engine, no state machine, no workflow definition language. Just a loop.

This simplicity is intentional. The orchestrator adds one thing that a raw LLM call doesn't: sequential structure with shared context. Everything else (tool execution, model selection, governance injection, memory recall) happens at layers that already exist.

The tradeoff

Steps trade LLM autonomy for developer control. The model can no longer decide its own workflow. It can't skip a step it considers unnecessary or add a step it thinks is missing. The developer has to get the decomposition right.

This is the correct tradeoff for production systems. In production, predictability matters more than adaptability. You want the model to reliably execute the workflow you designed, not creatively reinterpret it.

If you need the model to adapt, you can build that into individual steps. A step that says "evaluate the draft; if it scores below 7, revise and re-evaluate" gives the model local autonomy within a bounded scope. The step is a sandbox for creativity, not an open field.

What we shipped

The Responses API (POST /api/v1/responses) accepts a steps[] array. Each step has a prompt, optional tool scope, and optional max step count for tool loops within the step.

For developers who don't need step-based orchestration, the endpoint also accepts messages[] for a simpler, single-step experience. And POST /api/v1/chat/completions provides a standard OpenAI-compatible interface for drop-in migration.

Both endpoints support Personize's governance (SmartGuidelines), memory (recall and memorize), and BYOK (bring your own LLM key for direct provider access to OpenAI, Anthropic, Google, and others).

The step orchestrator, the tool splitter, and the client tool execution loop are all open components. The SDK handles the requires_action loop automatically, so from the developer's perspective, client-executed tools look like local function calls.

Instruction decomposition is not a new idea. Structured prompting, prompt chaining, and orchestration frameworks all explore aspects of it. What's different here is the combination: sequential steps with shared context, scoped tool access, client-executed tools, and the observation that steps produce multi-agent behavior without multi-agent complexity. The messages paradigm made sense when LLMs were chatbots. For agents doing work, the instruction is the right unit of composition.