The Four-Layer Architecture Behind Governed Memory

Dual memory, governance routing, reflection-bounded retrieval, and schema lifecycle — the architecture we built when RAG wasn't enough.

TL;DR

Governed Memory is a four-layer architecture: dual memory store, governance routing, governed retrieval, and schema lifecycle.
Each layer addresses a distinct problem that RAG leaves vacant: what gets stored, what rules apply, how retrieval stays complete, and how quality is maintained over time.
The dual memory model stores both open-set facts and schema-enforced typed properties in a single extraction pass — neither modality alone captures the full picture.
Governance routing delivers organizational context with 92% precision, using a fast path (~850ms) that avoids LLM calls entirely.
The architecture achieves 74.8% on the LoCoMo benchmark, confirming that governance and schema enforcement impose no retrieval quality penalty.

I've written before about the problems: agents that don't share what they know, governance that fragments across teams, and the gap between retrieval, memory, and governance as infrastructure concerns.

This post is about what we actually built. The architecture behind Governed Memory at Personize — how the pieces fit together, what each layer does, and why they're separated the way they are.

Why Four Layers

The temptation is to build one system that does everything. A vector store with some governance rules mixed in. A memory layer that also handles policy routing. We tried versions of that. They all collapsed under the same pressure: different concerns have different update frequencies, different failure modes, and different scaling characteristics.

Memory changes every time an agent processes content. Governance changes when legal updates a policy. Retrieval quality depends on what's in the store. Schema quality degrades silently over months. Coupling these means a change in one breaks the others in ways that are hard to diagnose.

So we separated them. Four layers, each addressing a distinct governance concern, each independently configurable.

Layer 1: Dual Memory Store

Dual extraction pipeline — single LLM call producing open-set facts and schema-enforced properties, with 38% open-set only, 34% overlap, and 12% schema only

This is where content enters the system. When any agent processes content about an entity — a call transcript, an email, a document, a chat log — the dual extraction pipeline produces two parallel outputs in a single LLM call:

Open-set memories are atomic, self-contained facts. "The CTO mentioned they are evaluating three vendors." "The team recently migrated from AWS to Google Cloud." These are extracted with five invariants: completeness, self-containment, coreference resolution, temporal anchoring, and atomicity. Before storage, each candidate is compared against existing entries — candidates exceeding a cosine similarity threshold are skipped, preventing duplicate accumulation.

Schema-enforced memories are typed property values extracted according to an organizational schema. If you've defined properties like job_title, budget_range, buying_stage, and pain_points, the system extracts and validates typed values with confidence scores. Before extraction, embedding similarity selects only relevant properties from potentially hundreds in the schema, reducing hallucination.

Why both? Because neither alone captures the full picture. Across our experiments on 250 samples across five content types, 38% of valuable information exists only as open-set observations that no schema anticipated — relational facts, qualitative observations, contextual details that would be permanently lost in a schema-only system. Another 12% is captured only through schema enforcement, where type validation ensures structured, queryable values that open-set extraction can't guarantee.

The dual model accepts overlap deliberately. About 34% of information is captured by both modalities. That's fine. The alternative — losing information to modality mismatch — is worse.

Three lightweight quality gates run on every extraction batch: coreference scoring (detecting unresolved pronouns), self-containment scoring (checking syntactic completeness), and temporal anchoring scoring (flagging relative time references). These filter noise before it enters the store and degrades retrieval for every downstream workflow.

Layer 2: Governance Routing

Governance routing — three parallel enrichment steps feeding into fast mode (~850ms, no LLM) and full mode (~2-5s, two-stage LLM pipeline)

This is the layer that took us the longest to get right. The problem: when an agent needs to operate within organizational rules — pricing policies, brand voice, compliance requirements — how do you get the right rules to the right agent for the right task?

We tried RAG for this. It failed. "Write a cold outbound email" retrieved chunks about email server configuration. A pricing policy competed with a product FAQ that mentioned "pricing." Semantic similarity doesn't understand authority.

Governance routing works on different principles. Organizational context is stored as governance variables — structured metadata including name, description, tags, content, and content-aware embeddings. When a variable is created, three enrichment steps run in parallel:

Hypothetical Prompt Enrichment (HyPE) — generating synthetic queries representing plausible agent requests. If you have a "Brand Voice Guidelines" variable, HyPE generates queries like "how should I write a cold email?" and "what tone should support responses use?" These synthetic queries are embedded alongside the variable, so routing matches on intent, not just keyword overlap.
Governance scope inference — an LLM determines whether the variable is always-on (like a compliance policy that applies to every interaction) and infers trigger keywords.
Content-aware embedding computed from metadata and a content preview.

The routing itself has two modes:

Fast mode (~850ms average) makes no LLM calls. Each candidate is scored using a weighted composite of embedding similarity and keyword overlap, with a governance scope boost for always-on variables. This is the default for most production tasks.

Full mode (~2–5s) adds a two-stage LLM pipeline: embedding pre-filter reduces candidates, then structured analysis classifies context as critical or supplementary with section-level extraction. This mode is for high-stakes decisions where precision matters more than latency.

Against 25 governance variables across 5 categories, routing achieves 92% precision and 88% recall across 20 diverse task types. The result that surprised us most: well-authored governance variables are 20–50 percentage points more discoverable than poorly-authored equivalents. In 3 of 5 categories, poorly-authored variables scored 0% discovery rate. This validated the AI-assisted authoring tools as operationally significant, not just nice-to-have.

Progressive Context Delivery

Modern agents operate in autonomous loops — planning, acting, observing, re-planning — without human intervention between steps. Each step may invoke governance routing. Without session awareness, the same compliance policy gets re-injected into every step, consuming context window capacity that should be spent on reasoning.

Progressive delivery maintains a session state that tracks which variables have already been delivered. On each routing call, already-delivered variables are excluded — only new or newly relevant content is injected. This reduces token usage by 50% across multi-step workflows, and it's not just a cost optimization. Redundant context competes with fresh instructions for the model's attention. Removing it improves accuracy.

Layer 3: Governed Retrieval

Governed retrieval pipeline — query embedding through vector search, post-filtering, reflection-bounded retrieval with +25.7pp completeness improvement, merge, and LLM synthesis

The retrieval path: query embedding → vector search within the organization partition with entity-scoped CRM key filters → post-filtering by metadata → optional reflection loop → merge and deduplication → optional LLM answer synthesis with source attribution.

Two mechanisms make this different from standard RAG:

Entity-scoped isolation. All operations are partitioned by organization ID. Within an organization, retrieval is scoped to specific entities using CRM keys. Under adversarial conditions — 100 entities with same industry, similar roles, overlapping names, similar deal sizes — entity-scoped retrieval produces zero cross-entity leakage across 3,800 results. The isolation is enforced by CRM key pre-filtering, not embedding distinctiveness. This is an important distinction: embeddings for entities in the same industry with similar roles will be similar. Relying on embedding distance for isolation would fail.

Reflection-bounded retrieval. For complex queries where information is scattered across multiple sources, a reflection loop checks evidence completeness and generates targeted follow-up queries within bounded rounds (default: 2). This improves completeness from 37.1% to 62.8% on hard multi-hop queries — a 25.7 percentage point improvement. The key finding: query generation strategy matters more than round count. Manual multi-hop retrieval (62.8%) dramatically outperforms API-managed reflection (40.4%), and a single round (61.2%) gets most of the way to the two-round ceiling (62.8%).

Hybrid retrieval operates across both memory types simultaneously. A standalone entity context injection endpoint compiles per-entity data from both tiers into a token-budgeted context block, prioritizing schema-enforced properties (more structured, actionable) over open-set memories (ordered by recency).

Layer 4: Schema Lifecycle and Quality Feedback

Schemas age. Models get updated. Content types shift. New agent workflows produce data the schema was not designed for. This layer closes the feedback loop.

AI-assisted authoring. Users describe what they need in natural language — "I want to capture each contact's role, buying intent, and preferred communication channel" — and an AI assistant generates typed property specifications with extraction instructions and validation constraints.

Interactive enhancement. Once a schema is in use, operators describe observed issues — "the role field is too vague" or "buying intent should distinguish between active evaluation and future interest" — and receive revised property definitions.

Criteria-based rubric scoring. Agent interactions are scored against domain-specific rubrics (sales, support, research, or custom). Evaluation captures structured traces: conversation summary, tool usage, memory recall log with usage flags, memory creation log, and governance context with helpfulness ratings. This answers diagnostic questions that scalar scores can't: "Did the agent score low on Completeness because it failed to recall relevant memories, or because it recalled them but didn't use them?"

Automated schema refinement. A three-phase pipeline: extraction replay producing baseline results, per-property analysis classifying each property as extracted/missed/low-confidence/inaccurate, and parallel per-property optimization producing revised definitions. The three-phase design separates objective data, diagnostic judgment, and targeted fixes.

How It Performs

We validated each mechanism through controlled experiments (N=250, five content types):

| Mechanism | Result | |---|---| | Fact recall (dual extraction) | 99.6% | | Governance routing precision | 92% | | Token reduction (progressive delivery) | 50% | | Retrieval completeness improvement (reflection) | +25.7pp | | Cross-entity leakage | Zero (3,800 results) | | Adversarial governance compliance | 100% (50 scenarios) | | LoCoMo benchmark (overall) | 74.8% |

The LoCoMo result matters most for the architectural argument. 74.8% overall accuracy confirms that governance, schema enforcement, and entity isolation impose no retrieval quality penalty. The architecture achieves state-of-the-art memory accuracy despite doing substantially more than standalone memory systems. On open-ended inference — the largest category at 841 questions — the system exceeds human-level performance (83.6% vs. 75.4%).

The Architectural Principle

Each layer can be independently configured or disabled. You can run dual extraction without governance routing. You can use governance routing without reflection. You can skip schema lifecycle management entirely and still get the other three layers.

But they compound. Memory without governance means agents share knowledge but not rules. Governance without memory means rules reach agents that have no context about the entities they're acting on. Retrieval without quality feedback means silent degradation over months.

The full system — memory, governance, retrieval, and lifecycle — consistently outperforms any subset by 6–7 points on end-to-end quality evaluation. Not because each layer adds dramatic gains individually, but because the gaps between layers are where organizational trust in AI erodes.

The system is commercially deployed at Personize. The paper formalizing this architecture is forthcoming.

Frequently Asked Questions

How does this relate to RAG? RAG is a retrieval primitive — it addresses how to find relevant information. Governed Memory operates at the infrastructure layer above, extending retrieval with schema enforcement, organizational governance routing, entity-scoped isolation, and quality feedback. The relationship is architectural, not competitive; the governed layer requires capable retrieval underneath.

Can I use this with my existing agent framework? The system exposes a standard MCP interface and SDK. Any compatible agent — regardless of framework or vendor — can read, write, and govern memory through the same organizational context without bespoke integration.

What about multi-tenant security? All operations are partitioned by organization ID at the storage layer. A two-phase content redaction pipeline scrubs PII and secrets before and after LLM extraction. Entity detection covers four tiers: secrets (API keys, private keys), financial PII (credit cards, IBAN), identity PII (SSNs), and contact PII (emails, phones, IPs).

How do I get started? Start with Layer 1 (dual memory) for a single workflow. Add Layer 2 (governance routing) when you have organizational policies that need to reach multiple agents. Add Layer 3 (reflection) when retrieval completeness matters. Add Layer 4 (schema lifecycle) when you're running at scale and need to monitor extraction quality over time.