Schemas Are Living Documents: The Closed-Loop Refinement Pipeline

Schemas age. Models get updated. Content types shift. New agent workflows produce data the schema wasn't designed for. Here's how to build a schema that keeps up.

TL;DR

The biggest risk in schema-enforced memory isn't building the initial schema — it's maintaining it as everything around it changes.
Schema quality has an outsized impact on governance routing: well-authored governance variables are 20–50pp more discoverable than poorly-authored ones. Authoring is an operational concern.
AI-assisted authoring translates natural language descriptions into typed property specifications. Interactive enhancement refines based on observed issues.
Criteria-based rubric scoring captures not just whether the output was good, but whether the agent used the memories it recalled and whether the governance context influenced the output.
Automated refinement runs a three-phase pipeline: replay extraction, classify each property, generate revised definitions in parallel.

The schema feels like a one-time problem. Define your properties, set up extraction, deploy. Done.

Two months later, extraction quality has quietly degraded. A model update changed how the LLM interprets certain property descriptions. New content types appeared — customer Slack exports, voice call transcripts — that the schema wasn't designed for. An operator added 30 new properties without updating descriptions to match how the new content uses those concepts.

No alerts. No errors. Just slightly worse outputs, slowly, everywhere.

This is the silent failure mode of schema-enforced memory. And it's why we treat schemas as living documents.

Why Schemas Drift

Three forces degrade schema quality over time:

Model updates. When the underlying LLM changes — version update, provider switch, fine-tune — its interpretation of property descriptions shifts. A property description written to work with one model's tendency to be literal may under-extract with a model that interprets instructions more conservatively. These shifts are small individually and compound over months.

Content type evolution. Organizations deploy new agent workflows. A sales-focused schema meets support tickets, Slack exports, voice transcripts, product usage logs. The extraction hints that work for sales call notes don't work for Slack threads. Properties return null not because the information isn't there, but because the extraction hint doesn't match how that information appears in the new content type.

Schema accumulation. Schemas grow. Properties are added for new use cases. Old properties become stale as business focus shifts. Nobody removes them. The LLM is presented with an increasingly large set of properties, many irrelevant to the content being processed, increasing the noise in the extraction prompt and degrading precision for the properties that do matter.

Silent drift is worse than obvious breakage. Obvious breakage gets fixed. Silent drift gets shipped to downstream systems, reported as accurate data, and acted on.

The Four-Stage Lifecycle

We built a schema lifecycle with four stages designed to prevent drift and close the feedback loop.

Stage 1: AI-Assisted Authoring

The problem with schema design is that it requires two types of knowledge simultaneously: domain knowledge about what information matters, and technical knowledge about how to write property descriptions that produce reliable extraction.

Most domain experts have the first. Almost none have the second.

AI-assisted authoring bridges this. Users describe what they need in natural language: "I want to capture each contact's role, their buying intent, and whether they've had issues with competitors." The system translates this into typed property specifications with names, types, descriptions, extraction hints, and validation constraints.

This isn't just convenience. Authoring quality has a measurable impact on downstream performance. Governance variables — which follow the same authoring principles as schema properties — show a 20–50 percentage point discoveryability gap between well-authored and poorly-authored versions. In 3 of 5 categories tested, poorly-authored variables scored 0% discovery rate. The routing system couldn't find them at all.

The same principle applies to schema properties. A property titled "Status" with no description will extract inconsistently and unpredictably. A property titled "Buying Stage" with a description that includes the expected values, how they appear in conversation, and what distinguishes them from each other will extract reliably. AI-assisted authoring produces the latter.

Stage 2: Interactive Enhancement

Once a schema is in production, operators observe what's working and what isn't. "The role field is returning job titles when I want seniority level." "Buying intent is conflating active evaluation with future interest." "The pain points field is capturing everything as pain points including things the prospect likes."

Interactive enhancement turns these natural language observations into revised property definitions. Users describe the issue; the system generates updated descriptions, extraction hints, and validation constraints. Changes can be made at per-property or bulk granularity.

Interactive enhancement is the feedback loop that closes between what operators observe and what the schema specifies.

Without this, operators either learn to write extraction prompts (a specialized skill) or live with degraded extraction quality. With it, the domain expert who knows the content can directly improve the schema without understanding the mechanics of LLM extraction.

Stage 3: Criteria-Based Rubric Scoring

This is where the feedback loop gets data.

Standard evaluation asks: was the output good? Rubric scoring asks: why was the output good or bad, and which parts of the memory and governance infrastructure contributed?

Each agent interaction is scored against domain-specific rubrics. We provide four presets — default, sales, support, research — each normalized to 100 points with weighted criteria. Organizations can define custom rubrics.

The scoring captures a structured execution trace:

Conversation summary — what the agent did
Tool usage log — which tools were called and in what order
Memory recall log with usage flags — which memories were retrieved AND whether they appeared in the output
Memory creation log — what new memories were written
Governance context log with helpfulness ratings — which governance variables were delivered and whether they influenced the response

The usage flags on the memory recall log answer a diagnostic question that scalar scores can't: "Did the agent score low on Completeness because it failed to recall relevant memories, or because it recalled them but didn't use them?" These are different problems requiring different fixes. The first is a retrieval problem. The second is a prompting or memory selection problem.

The trace is more valuable than the score. The score tells you something is wrong. The trace tells you where.

When evaluation reveals extraction quality issues — a property consistently returning null, low confidence scores, inaccurate values — automated refinement addresses them systematically.

The pipeline runs three phases:

Phase 1: Extraction replay. The current schema is run against a sample of content that previously produced poor extraction. This establishes the baseline — exactly how the current schema performs on this content type.

Phase 2: Per-property analysis. Each property is classified: extracted correctly, missed, low confidence, inaccurate, or unavailable (the information genuinely isn't in the content). Properties in the missed/low-confidence/inaccurate categories get structured improvement instructions generated — specific guidance on what to change and why.

Phase 3: Parallel per-property optimization. Revised definitions are generated for each underperforming property in parallel, keeping total latency bounded regardless of schema size. Each revised definition includes change annotations explaining what was modified and why.

The three-phase design separates objective data (Phase 1), diagnostic judgment (Phase 2), and targeted fixes (Phase 3). This separation matters: combining them would have the optimization working without a clear baseline to measure against, and the diagnostic would collapse into the fix rather than informing it.

The Schema as Monitoring Infrastructure

The four-layer architecture describes schema lifecycle as Layer 4. It's positioned last because it depends on the other three layers to generate the feedback it needs:

Layer 1 (dual memory store) produces the extraction results that evaluation scores
Layer 2 (governance routing) delivers the governance context whose helpfulness gets rated
Layer 3 (governed retrieval) retrieves the memories whose usage flags get logged

Schema lifecycle closes the loop that the other three layers open.

Without Layer 4, you have a memory system that runs reliably until it doesn't, with no mechanism to detect the transition. With it, you have a system that produces ongoing signals about where quality is holding and where it's drifting — before the drift affects downstream systems.

The metrics that matter for ongoing monitoring:

Per-property extraction rate — what fraction of processed content produces a non-null value for each property
Per-property confidence distribution — are confidence scores shifting over time?
Memory usage rate in outputs — are retrieved memories appearing in agent outputs, or being ignored?
Governance helpfulness ratings — are governance variables rated as useful or noise?
Low-score frequency by criterion — which rubric dimensions degrade first when quality slips?

These aren't one-time measurements. They're the instrumentation of a production memory system. Schema lifecycle management is how you keep the signal clean as the world around the schema changes.

Frequently Asked Questions

How often should automated refinement run? Refinement is triggered by evaluation signals, not on a fixed schedule. When per-property accuracy falls below a threshold, or when a new content type produces high null rates across multiple properties, refinement runs on the affected properties. In production, this might be weekly for active schemas on high-volume workflows, or event-triggered on schema changes.

Does refinement overwrite existing extracted values? No. Refinement updates the schema definition, which affects future extractions. Existing values remain in the store. If you want to backfill historical extractions with the improved schema, extraction replay can be run against historical content — producing updated values that coexist with originals until a migration is complete.

What if the automated refinement makes things worse? Phase 1 establishes a baseline for each property. After Phase 3 produces revised definitions, the organization can validate on a holdout sample before deploying. The change annotations from Phase 3 make it clear what changed and why, enabling human review before deployment if the organization's risk tolerance requires it.

How do custom rubrics work? Custom rubrics define criteria specific to the organization's use case, each with a weight contributing to the 100-point total. A legal review rubric might weight compliance accuracy at 40 points. A customer success rubric might weight relationship continuity at 30 points. The rubric is defined once and applied consistently across all scored interactions, enabling trend analysis across time.