38% of valuable information no schema anticipated. 12% only usable with type enforcement. Neither modality alone captures the full picture, and both come from one extraction pass.


TL;DR

  • Open-set memory extracts atomic facts regardless of schema: "The CTO mentioned they're evaluating three vendors." Flexible, comprehensive, but unstructured.
  • Schema-enforced memory extracts typed property values: buying_stage: "Evaluating", competitor_count: 3. Structured, queryable, but limited to what the schema anticipates.
  • Across 20 samples: 38% of valuable information is captured only by open-set, 12% only by schema-enforced, 34% by both. Combined recall reaches 82.8%.
  • Both outputs come from a single LLM call. The same content is processed once.
  • Five quality gates ensure extracted facts are coreference-resolved, self-contained, and temporally anchored before they enter the store.

Most of the memory literature, and most production systems, focus on free-form memories: atomic facts extracted from conversations, stored as text, retrieved by similarity. That works well for chat assistants and simple Q&A. But when you have workflows with complex business logic, when you need to utilize data across platforms, when downstream systems need typed values they can filter, aggregate, and act on, free-form isn't enough.

I started by introducing AI-extracted properties and custom properties to databases and CRMs. It was hard to scale. Each property had its own nuanced way of measuring and reporting. So I designed schemas built specifically for AI use, where each property carries a type, examples, extraction hints, and measurement criteria. We went further and used AI to refine these schemas for itself, which boosted extraction accuracy significantly.

But schema-enforced memory doesn't replace free-form. It complements it. The schema captures what you can anticipate; open-set extraction captures what you can't. We run both from the same content in the same extraction pass, and this dual memory architecture is now live in Personize.ai. The experiment below shows exactly where each modality contributes, what the overlap looks like, and why the combination reaches a recall ceiling that neither achieves alone.

What Each Modality Does

Open-set memory extracts atomic, self-contained facts from unstructured content. No schema required. The extraction prompt enforces five invariants per fact:

  • Completeness: the fact contains all information needed to understand it
  • Self-containment: the fact can be understood without reference to surrounding context
  • Coreference resolution: pronouns are replaced with named entities ("he" → "the CTO, James Chen")
  • Temporal anchoring: relative time references are resolved ("recently" → "in October 2025")
  • Atomicity: each fact states one thing

From a sales call transcript, open-set extraction produces facts like:

"The CTO, James Chen, mentioned in the October 2025 call that they are currently evaluating three cloud migration vendors, including Personize."

"The VP of Engineering expressed frustration with their current provider's API reliability during peak load periods."

"The company recently hired two platform engineers, signaling active infrastructure investment."

These facts are retrievable by any agent, across any workflow, without knowing in advance what questions will be asked. The third fact (the hiring signal) is exactly the kind of insight that competitive intelligence depends on and that no schema anticipates.

Schema-enforced memory is what I built when free-form wasn't enough. Early on, I was adding AI-extracted properties directly to CRM fields, one property at a time, each with its own extraction logic, its own edge cases, its own way of measuring accuracy. It didn't scale. What scaled was designing schemas specifically for AI consumption: each property carries a type, examples, extraction hints, and measurement criteria. The AI doesn't just extract values; it uses the schema's own definitions to decide how to extract them. From a sales call transcript:

buying_stage:      "Evaluating"         (type: options, confidence: 0.88)
competitor_count:  3                    (type: number, confidence: 0.90)
pain_points:       ["API reliability"]  (type: array, confidence: 0.85)
budget_range:      "$450,000"           (type: number, confidence: 0.92)

Dual extraction flow: one LLM call producing open-set facts and typed schema properties

These values are typed, validated, and queryable. They feed directly into the downstream systems that free-form memories can't serve: CRM fields that need numbers, workflow triggers that need enums, pipeline reports that aggregate across 10,000 contacts. The hiring signal from fact #3 above won't appear here; the schema didn't anticipate it. But the budget and buying stage will be precise, structured, and machine-readable. That's the point: each modality covers what the other misses.

The Coverage Numbers

The intuition that both modalities are needed isn't enough. We wanted to measure exactly where each one contributes and where they overlap. We ran a controlled experiment across 20 samples, annotating every piece of valuable information in the source content and tracking which modality captured it:

| Category | % of Total | |---|---| | Captured by both modalities | 34% | | Open-set only (long-tail insights) | 38% | | Schema-enforced only (typed values) | 12% | | Missed by both | 16% | | Combined recall | 82.8% |

Dual memory coverage breakdown: where valuable information lives

38% of valuable information exists only as open-set observations that no schema anticipated: relational facts, qualitative observations, hiring signals, competitive mentions, contextual details that are specific to the moment and the entity. In a schema-only system, these are permanently lost. They're extracted once, used once if the right agent happens to ask the right question, and then gone.

12% is captured only by schema enforcement. These are cases where the open-set extractor produces natural language that's imprecise ("a budget in the mid-six-figures" instead of "$450,000") or where type validation is essential for downstream consumption. This is exactly the gap I saw when building CRM integrations: a CRM field for deal value needs a number, not a prose approximation. The schema's type definitions and extraction hints solve this at write time.

The 34% captured by both modalities represents intentional redundancy. The dual model accepts this overlap deliberately. The alternative (trying to allocate each fact to exactly one modality) would require a routing decision during extraction that introduces its own errors. Overlap is cheaper than loss.

16% is missed by both modalities. Some information in unstructured content resists extraction: deeply implicit inferences, information that's only meaningful in context that spans multiple sources, ambiguous statements that require domain knowledge to interpret. We report this honestly. The 82.8% combined recall represents a real ceiling for single-pass extraction. Knowing where the ceiling is matters as much as raising it.

One Extraction Pass

The dual model runs in a single LLM call. The extraction prompt receives both the content and the selected schema properties, and produces two parallel outputs simultaneously:

  1. Typed property values with confidence scores and update semantics
  2. Open-set atomic facts

This ensures the same content is processed once. The alternative (running two separate extraction passes) doubles the cost and introduces inconsistency (the two passes may interpret the same content differently).

Before extraction, property selection reduces the schema to only relevant properties. Embedding similarity between the content and property metadata selects a subset (minimum score threshold, maximum count cap). A sales call transcript about cloud migration doesn't need properties designed for customer support tickets. The LLM sees only what's relevant, reducing hallucination and improving type compliance.

This is one of the advantages of designing schemas for AI use from the start. Each property's metadata (description, examples, extraction hints) serves double duty: it guides the AI during extraction, and it enables smart property selection so the model isn't overwhelmed by irrelevant fields. When we added AI-driven schema refinement on top of this, letting the model suggest improvements to property definitions based on extraction performance, accuracy improved significantly. The schema becomes a living document that the AI helps maintain.

Quality Gates

Both modalities benefit from quality gates that run on every extraction batch before anything enters the store.

Coreference scoring detects unresolved pronouns: facts containing "he," "they," "it" without clear resolution. A fact like "He mentioned they're evaluating options" is ambiguous. It needs to be resolved before storage or discarded. The gate scores each fact and flags those below threshold.

Self-containment scoring checks syntactic completeness: facts that reference context they don't contain. "The migration timeline they discussed" is incomplete without knowing who "they" is or what migration. The gate catches these before they create retrieval noise.

Temporal anchoring scoring flags relative time references ("recently," "last quarter," "next month") that will become ambiguous over time. A fact stored in October 2025 that says "they recently migrated" is meaningless by April 2026 without knowing the anchor date. The gate flags these for resolution.

These are lightweight heuristic checks, not deep semantic analysis. They run in milliseconds per batch. But their effect compounds: a clean store produces better retrieval results for every query, every workflow, forever. This kind of systematic write-time investment is what makes the difference between a demo and a production system. Write-time quality gates are a one-time cost. Retrieval-time noise is an indefinite tax.

Deduplication at Write Time

When multiple workflows process overlapping content about the same entity (the enrichment agent processes a LinkedIn update, the support agent processes an email, the outbound agent processes a call recording), the same facts appear in multiple sources. Without deduplication, they accumulate.

Before storage, each open-set candidate is compared against existing entries using cosine similarity. Candidates exceeding a threshold (default: 0.92) are skipped as near-duplicates of something already in the store. A separate background consolidation process periodically merges near-duplicates using a higher threshold (0.95) to minimize false merges.

For schema-enforced properties, deduplication works differently: the highest-confidence value for each property wins, with explicit update flags supporting both replacement (single-value properties like budget) and accumulation (array properties like pain points).

We measured this systematically: in a controlled multi-source experiment (five overlapping sources processed for the same entity), 83% of candidate memories were duplicates of something already stored. Without deduplication, those duplicates would have entered the store and degraded retrieval quality. The signal-to-noise ratio drops as duplicates accumulate, and the effect is silent until retrieval completeness visibly degrades months later. This is the kind of problem you only find by running controlled experiments against production-like data, not by reasoning about architecture in the abstract.

When to Lean on Each Modality

The dual model runs both modalities for all content. Neither replaces the other. But downstream systems benefit from leaning on each modality for different tasks:

Use open-set memory when:

  • The query requires long-tail insights not captured by the schema
  • You're building context for personalization (nuanced, relationship-specific facts)
  • You need to retrieve information about topics the schema didn't anticipate
  • The agent is doing research or synthesis across scattered observations

Use schema-enforced memory when:

  • A downstream system needs structured, queryable values
  • You're triggering conditional workflows based on entity state
  • You're syncing to a CRM, data warehouse, or analytics system
  • You need to aggregate or filter across many entities

For prompt augmentation (the most common use case), both modalities contribute. Open-set facts provide the context and nuance. Schema-enforced properties provide the precision and structure. A sales email that references the CTO's specific frustration with API reliability (open-set) and correctly acknowledges the $450K budget range (schema-enforced) is better than one that has either alone.


Frequently Asked Questions

Why not just use open-set and parse structure out at query time? This is what most systems try first. It's what I tried first. Parsing structure at query time requires an LLM call per query, adds latency, introduces parsing errors, and produces inconsistent results depending on how the query is framed. Schema-enforced extraction happens once at write time, with a consistent prompt and type validation. The cost is paid once; the benefit is available forever.

What if my schema evolves? Do I lose old extractions? Schema changes affect future extractions. Existing schema-enforced values remain in the store under the original property definitions. The automated schema refinement pipeline can replay extraction against existing content with revised definitions, producing updated values that can coexist with originals until a migration is complete.

How many schema properties is too many? Property selection handles this. Even with hundreds of properties in the schema, the LLM only sees the relevant subset for each extraction. The practical limit is on schema organization and maintenance quality, not on the count. This is why we designed schemas with rich metadata from the start. Well-authored properties with clear descriptions, extraction hints, and examples are more valuable than a large number of vaguely defined ones. And since AI can refine these definitions based on extraction results, the schema improves over time rather than degrading.

What about properties that almost always have null values? Null extractions are tracked with confidence scores. A property that consistently returns null across a content type is either poorly authored (the schema description doesn't match how the information appears in the source) or genuinely absent from that content type. The schema lifecycle tools identify both patterns: poorly-authored properties are candidates for refinement, absent properties are candidates for removal or scope restriction.