What happens when you strip an AI API down to reasoning only? The tools run on the developer's servers. The state lives in signed payloads, not databases. Governance becomes infrastructure. And evaluation becomes a built-in second opinion.
TL;DR
- Most AI APIs are monoliths: they run the model, execute the tools, store the conversation state, and manage the guardrails. Each responsibility adds infrastructure cost, latency, and attack surface.
- We built an AI API that only does reasoning. Tool execution happens on the developer's servers. Conversation state is carried in HMAC-signed payloads (no server-side storage). Governance is an addressable parameter, not a hardcoded prompt. Evaluation is a toggleable second LLM call that scores every response.
- This decomposition produces four properties: (1) sensitive data never leaves the developer's infrastructure, (2) the server is stateless and horizontally scalable, (3) organizational rules are versioned objects that update once and propagate everywhere, (4) quality assurance is built into the API, not bolted on after.
- The design challenges the assumption that AI APIs need to be thick middleware. A thin reasoning layer that owns nothing but the thinking turns out to be more useful, more secure, and cheaper to operate.
The monolith problem
A typical AI API call involves four distinct responsibilities:
- Reasoning: the LLM processes context and generates a response
- Tool execution: the LLM calls functions (web search, database queries, API calls) and incorporates the results
- State management: the conversation history is stored between requests so multi-turn interactions work
- Governance: safety rules, brand guidelines, and compliance policies are enforced
Most AI APIs handle all four. The provider runs the model, hosts the tool execution environment, maintains conversation sessions, and applies content filters. This is convenient. It is also architecturally expensive.
Every tool the provider executes runs on the provider's compute. Every conversation session consumes the provider's storage. Every governance rule the provider enforces is one the developer cannot customize without the provider's cooperation. The provider becomes a thick middleware layer between the developer and their users, handling responsibilities that the developer may not want to delegate.
The question we asked: what if the AI API only did the reasoning?Part 1: The compute split
When the LLM decides to call a tool, that decision is the reasoning. The actual execution of the tool (querying a database, calling a CRM, hitting an internal microservice) is work. These are different things, and they don't need to happen on the same server.
In our architecture, the LLM reasons on our infrastructure. When it decides a tool call is needed, the call is returned to the developer's code. The developer's code executes the tool on their own servers, sends the result back, and reasoning continues.
const result = await personize.responses.create({
steps: [
{ prompt: 'Look up this lead and draft a personalized follow-up' }
],
tools: {
crm_lookup: {
description: 'Find a contact in our CRM by email',
parameters: { type: 'object', properties: { email: { type: 'string' } } },
execute: async (args) => {
// This runs on the developer's server
const contact = await db.contacts.findByEmail(args.email);
return { name: contact.name, company: contact.company, lastActivity: contact.lastActivity };
}
}
}
});The execute function never leaves the developer's process. The SDK sends only the tool schema (name, description, parameters) to our API. When the LLM invokes the tool, our API returns a requires_action response. The SDK intercepts it, runs execute locally, and sends the result back as a new request. Our API resumes reasoning.
Three things follow from this split.
Sensitive data stays home. The CRM query runs on the developer's database. Customer records, deal values, contact history: none of it transits our servers. We see the tool schema ("this tool looks up contacts by email") and the result the developer chooses to send back. We never see the query, the connection string, or the full database response.
Tool compute is the developer's cost, not ours. A tool that runs for 30 seconds against a slow internal API doesn't consume our resources. We're idle while the developer's code executes. This changes the economics: we charge for reasoning (tokens and orchestration time), not for the unbounded cost of arbitrary tool workloads.
Tool availability is unlimited. The developer can register any function as a tool. Internal APIs, private databases, on-premise systems, proprietary services. There is no marketplace, no approval process, no integration to build on our side. If the developer can call it from their code, the LLM can use it.
How the loop works
The protocol is HTTP, not WebSocket. No held connections, no server-side event queues. Each round of the tool loop is a complete HTTP request/response cycle.
Round 1:
Client → POST /api/v1/responses { steps, tools: [schema] }
Server → 200 { status: "requires_action", tool_calls: [...], conversation: [...] }
Client executes tool locally
Round 2:
Client → POST /api/v1/responses { conversation: [...], tool_results: [...] }
Server → 200 { status: "completed", text: "..." }
The SDK abstracts this into a single await call. The developer defines tools with execute functions and gets back a completed response. The loop is invisible unless they want to see it.
Part 2: Stateless conversation continuity
The compute split creates a problem. Multi-turn tool loops require the server to know where it left off. Which step was executing? What was the conversation history? What tool calls are pending?
The standard solution is server-side sessions. Store conversation state in a database, return a session ID, look it up on the next request. This works, but it introduces storage costs, TTL management, cache invalidation, and a scaling bottleneck. Every request hits the session store before reasoning can begin.
We went a different direction: the client carries the state.
When the server returns requires_action, the response includes the full conversation history and an HMAC-SHA256 signature:
{
"status": "requires_action",
"required_action": {
"tool_calls": [{ "id": "tc_1", "function": { "name": "crm_lookup", "arguments": "{\"email\":\"j@acme.com\"}" } }]
},
"conversation": [
{ "role": "user", "content": "Look up this lead..." },
{ "role": "assistant", "content": "", "tool_calls": [{ "name": "crm_lookup", "args": {"email": "j@acme.com"} }] }
],
"conversation_signature": "hmac_sha256_abc123..."
}On the next request, the client sends the conversation and signature back. The server verifies the HMAC before processing. If the conversation was modified (messages injected, history rewritten, tool calls altered), the signature check fails and the request is rejected.
This is JWT for conversation state. The server stores nothing. The client carries the full state. The signature guarantees integrity. The server is stateless and horizontally scalable. Add more instances, put a load balancer in front, and any instance can handle any request. No sticky sessions, no shared state store, no Redis.
What the signature covers
The HMAC is computed over the conversation content and a request fingerprint that includes the original step definitions and tool schemas. This prevents a subtle attack: a client cannot take a conversation from one request context and replay it against a different set of steps or tools. The signature binds the conversation to the original request parameters.
// Server-side verification
const fingerprint = createRequestFingerprint(steps, tools);
const valid = verifyConversation(conversation, signature, conversation.length, fingerprint);
if (!valid) {
return { error: 'Invalid conversation signature' };
}The developer never needs to think about this. The SDK handles signing and verification transparently. But the security property is real: our server processes exactly the conversation it generated, unmodified.
Part 3: Governance as infrastructure
Most AI APIs treat governance as a system prompt. The developer writes safety instructions, brand voice guidelines, and compliance rules into a system message, and hopes the model follows them. When the rules change, the developer updates the prompt in every place it's used. When different teams need different rules, they maintain different system prompts.
This is governance as copy-paste. It doesn't scale.
In our API, governance is an addressable parameter:
const result = await personize.responses.create({
steps: [{ prompt: 'Draft a cold outreach email' }],
personize: {
governance: {
guideline_ids: ['brand-voice', 'outreach-compliance', 'gdpr-rules']
}
}
});brand-voice, outreach-compliance, and gdpr-rules are not strings. They are versioned governance objects stored in your organization's account. Each one is a structured document (with sections, priorities, and scoping rules) managed through the dashboard or API.
When the request arrives, the governance layer retrieves the specified guidelines, selects the relevant sections based on the task context (using SmartGuidelines routing), and injects them into the model's context. The developer never sees the injected text. They reference guidelines by ID and the system handles delivery.
Three properties follow.
Single source of truth. Update a guideline once. Every API call that references it picks up the change. No redeployment, no prompt engineering across fifty endpoints, no coordination between teams.
Session-aware deduplication. Within a session (a series of related requests sharing a session ID), the governance layer tracks which guidelines have already been delivered. Step 3 of a five-step workflow doesn't re-inject the same brand voice guidelines that were delivered in step 1. This is progressive context delivery: each step gets only the governance content that's new or newly relevant.
The governance layer knows what the model has already seen and doesn't repeat itself. This alone reduces token usage by 50% in multi-step workflows.Scoped access. Different teams, different products, and different use cases can reference different guideline sets. The sales team's outreach pipeline references brand-voice and outreach-compliance. The support team's ticket responder references brand-voice and support-escalation-rules. The brand voice is shared. The domain-specific rules are scoped. No duplication, no divergence.
Governance is not a filter
This is a distinction worth making explicit. Content filters (the kind that reject responses containing certain keywords or topics) are post-hoc. They run after the model generates output and decide whether to block it.
Governance as we implement it is pre-hoc. The guidelines are injected into the model's context before generation. The model generates within the constraints, rather than generating freely and being filtered after the fact. This produces output that follows the guidelines naturally, rather than output that avoids triggering a filter.
The difference matters in practice. A filtered model that's told "don't mention competitors" might write an awkward response that dances around the topic. A governed model that has the actual competitive positioning guidelines in context will handle the topic correctly, because it knows what to say instead of what not to say.
Part 4: Built-in evaluation
The last piece is evaluation. When evaluate: true is set on a request, the API runs a second LLM call after the primary generation. This second call scores the response against configurable criteria using structured output (a Zod schema that enforces numeric scores, per-criterion breakdowns, and explanations).
const result = await personize.responses.create({
steps: [{ prompt: 'Draft a cold outreach email for {{company}}' }],
personize: {
governance: { guideline_ids: ['brand-voice'] }
},
evaluate: true,
evaluation_criteria: 'brand-voice-adherence, personalization-depth, call-to-action-clarity'
});
// result.evaluation:
// {
// finalScore: 82,
// criteriaScores: [
// { name: 'brand-voice-adherence', score: 9, maxScore: 10, reason: '...' },
// { name: 'personalization-depth', score: 7, maxScore: 10, reason: '...' },
// { name: 'call-to-action-clarity', score: 8, maxScore: 10, reason: '...' }
// ],
// explanation: 'The email follows brand voice guidelines closely...'
// }This is not the model evaluating itself. The primary generation uses whatever model the developer selected (or whatever model the tier defaults to). The evaluation uses a separate, fast model on a separate call. It sees the original prompt, the generated response, the tool calls that were made, the extracted outputs, and the evaluation criteria. It scores each criterion independently and provides reasoning.
Why built-in evaluation matters
Most teams bolt evaluation on after building the pipeline. They write test harnesses, sample outputs, and manually review quality. This works during development. It breaks in production, where you need continuous quality monitoring across thousands of requests.
Built-in evaluation makes quality a toggle. Turn it on for every request and you have a continuous quality signal. Turn it on for a random 10% sample and you have statistical monitoring. Turn it on with custom criteria per use case and you have domain-specific quality assurance.
The evaluation scores are stored in DynamoDB and accessible through the API. Over time, you build a dataset of scored outputs that shows how quality trends across models, prompts, and governance configurations. This is the data you need to make informed decisions about prompt changes, model upgrades, and guideline revisions.
Evaluation is not a feature. It's the feedback loop that makes everything else improvable.The evaluation is honest
Because the evaluation model is separate from the generation model, it doesn't have the self-serving bias that plagues self-evaluation. A model asked to evaluate its own output consistently rates itself higher than an independent evaluator does. Our evaluation is structurally independent: different model, different call, no shared state. It's a second opinion, not a self-assessment.
The thin layer thesis
These four pieces (client-side execution, stateless continuity, addressable governance, built-in evaluation) add up to an architectural thesis: an AI API should be a thin reasoning layer, not a thick middleware platform.
The reasoning layer owns the LLM call, the step orchestration, the governance injection, and the evaluation. It does not own the tools, the conversation state, the data, or the infrastructure those tools run on.
This changes three things about how the API operates.
Security model. The developer's sensitive data (customer records, internal metrics, proprietary content) never transits the reasoning layer. The LLM sees tool schemas and tool results, not raw database access. The developer controls exactly what information crosses the boundary.
Scaling model. The reasoning layer is stateless. No session store, no connection pool for tool execution, no cleanup jobs for expired conversations. Horizontal scaling is a matter of adding instances behind a load balancer. The compute-intensive work (tool execution) is distributed across the developers' own infrastructure.
Economic model. The developer pays for reasoning (tokens and orchestration time). They don't pay us for tool compute, data storage, or infrastructure that exists to support responsibilities we shouldn't own. BYOK (bring your own LLM key) takes this further: the developer can use their own API keys for OpenAI, Anthropic, Google, or any supported provider. In that mode, we charge a flat platform fee for orchestration, governance, and evaluation. We're infrastructure, not a reseller.
What we shipped
POST /api/v1/responses accepts a steps[] array for orchestrated execution or messages[] for simple use cases. Client tools are defined as JSON schemas with local execute functions. Governance is referenced by guideline IDs. Evaluation is a boolean toggle with optional custom criteria.
POST /api/v1/chat/completions provides an OpenAI-compatible interface for developers who want the governance and evaluation benefits without changing their request format.
The SDK (TypeScript and Python) handles the tool execution loop, conversation signing, and response parsing. From the developer's perspective, a request with client-executed tools, governance, and evaluation looks like a single await call.
The assumption behind most AI API design is that the provider should do more. More tool hosting, more state management, more middleware. We went the other direction: the provider should do less, and do it well. Reasoning, governance, evaluation. Everything else belongs to the developer. The thinner the layer, the more useful it turns out to be.