Hamed Taheri

I turn AI strategy into production-grade systems that organizations can trust, govern, and scale.

Co-founder & CTO at Personize.ai. I write about memory, governance, and the operating model of enterprise AI. Based in Vancouver.

Featured

65% Off Memorization: How Batch APIs and Prompt Caching Change the Cost Per Memory

Provider batch APIs cut LLM extraction cost in half. Prompt caching cuts another fifteen to twenty percent off. Combined, customers who can wait 1-24 hours pay roughly a third of what they would on the sync path. Here is the architecture that makes that discount actually reach the customer, and why writes are the right place to spend the latency budget you saved at retrieval.

Featured

Agent Dispatch is Live: One Script, Hundreds of Subagents, All Governed

A code-capable agent reads your records, decides what work each one needs, and places hundreds of trigger-bound subagents in a single script run. Some fire now. Some fire when conditions are met. All governed, all observable, all on the same per-minute rate. Here is what we shipped, the use cases we are exploring with early users, and what you already get.

Featured

Documents Are Memories Too: The Claude-Markdown-File Pattern in a Governed System

AI agents need a place to write sectioned markdown notes they can later read, append, or rewrite. The same shape Claude uses for its memory files, but inside a governance and retrieval layer that everything else uses. When to save as a document, how the section-update pattern works, and what flipping the aiExtraction switch does to your modality picture.

Featured

Instructions, Not Prompts: How We Run 1,000 Agents Per Minute Without the Bill Going Sideways

Per-token pricing punishes you for using AI at scale. Per-minute pricing forces you to design agents like production systems. Here is the pattern we converged on, the trade-offs of one big chain versus a forest of small agents, and what changes when the models themselves are designed for workflows.

Featured

One Call, Four Sources: How Atomic Memories and Typed Properties Make Unified Retrieval Possible

A real fact about a customer lives in four places at once: a typed property, an atomic memory, a document chunk, and a graph edge. Most retrieval stacks pick one and let the agent stitch the rest. Here is the storage discipline that makes returning all four in one call actually useful.

Featured

Retrieval Is a Conversation, Not a Query: Why Autonomous Agents Need Pagination as a First-Class API

Stateless retrieval is fine for chat. For autonomous agents taking 100 steps with no human in the loop, it is the single biggest source of wasted output tokens, context bloat, and silent quality collapse. Here is the pattern that replaces every-call-is-fresh with retrieval-as-conversation.

Featured

Schema-Guided Extraction: How Your Properties Teach the LLM What to Look For

Every property in your schema carries a type, a description, extraction instructions, examples, and measurement criteria. Together they form a per-property system prompt the LLM consumes at extraction time. You stop hand-prompting; the schema does it. Here is what an extraction-ready property actually looks like, and why this is the only way schema-based memory scales past a handful of fields.

Featured

Think and Execute For Me: Intent-Based Retrieval for Expensive Agents

Every output token your agent produces is at the premium rate of the strongest model in the loop. When the agent's job is to compress raw retrieved items into a paragraph, you are paying premium rates for compression. Here is how intent-based retrieval moves the synthesis to where the data lives, and what that does to the token bill.

Featured

The Three Channels of Graph Writes: Declared, Property-Inferred, LLM-Inferred

A relationship between two records can be told to the system, inferred deterministically from extracted properties, or surfaced by an LLM reading free-text memories. Three independent channels, three different cost profiles, opt-in per write. The same three apply whether the write target is a record or a document. The graph is not a separate system; it is an opt-in fan-out on the same save call.

Featured

Three Modalities, One Save Surface: Atomic Memories, Typed Properties, and Markdown Documents

AI agents need three different kinds of memory: extracted atomic facts, typed structured properties, and sectioned markdown documents they can revise like notes. The discipline is picking the right modality for the kind of knowledge. The unified save surface is what makes all three composable at retrieval time, which is the only place that matters.

Featured

57x Compression for AI Coding Agents

1.5M tokens of source code, compressed to 27K of curated context. Benchmarked against Graphify and raw grep on 20 real engineering tasks; the curated approach won on total workflow cost.

Featured

Moving Governance and Evaluation Below the Application Layer

Governance lives in system prompts. Evaluation lives in separate pipelines. State lives in session stores. We moved all three into the infrastructure layer of an AI API. Here is the architecture and what it changes.

Featured

We Replaced messages[] With steps[] in Our Agent API. Here's Why.

We started with the same messages[] pattern everyone uses. For complex, repeatable agent workflows it kept failing in predictable ways. So we decomposed instructions into sequential steps with scoped tools and shared context. Here's what we learned.

Featured

Code-Orchestrated Agents vs. Tool-Calling: The Architecture Decision That Matters Most

Stripe, Shopify, and Salesforce all converged on the same pattern: LLM decides, code executes. Here's the architectural reasoning, the trade-offs, and when tool-calling actually makes sense.

Featured

The Multi-Entity Memory Pattern

Most AI systems memorize contacts. The ones that work memorize contacts, their companies, their deals, and the relationships between all of them — then recall across entity boundaries at inference time.

Featured

Encoding Solution Architecture Into an AI Skill

The early stages of AI implementation are mostly discovery — assembling scattered context into a coherent system design. We spent two years encoding that process. Here's what we found.