Schema-Guided Extraction: How Your Properties Teach the LLM What to Look For

Every property in your schema carries a type, a description, extraction instructions, examples, and measurement criteria. Together they form a per-property system prompt the LLM consumes at extraction time. You stop hand-prompting; the schema does it. Here is what an extraction-ready property actually looks like, and why this is the only way schema-based memory scales past a handful of fields.

The thing nobody warns you about with structured memory

The first time you build typed memory extraction, you write the prompt by hand. You list the properties you want filled. You explain what each one means. You give an example or two. The LLM produces something, you tune the wording, you ship it.

The first ten properties are easy. By property twenty the prompt is bloated. By property thirty it has contradictions you did not mean to write. By property fifty it does not fit in the context window of any model you would actually pay for. The team that wants to add a new property argues with the team that already has a working extraction for the old ones, because the prompt is shared.

This is what every team building schema-enforced memory hits, and it is what kills most projects before they ship past their first vertical. The fix is not a better prompt-engineering technique. The fix is to stop writing the extraction prompt by hand and let the schema carry it.

The schema is the prompt. Each property carries its own extraction guidance. The system composes the LLM call from the schema definitions every time, for the properties the content actually needs.

This article is about what a schema-guided property actually looks like inside Personize, why each field carries weight, and how scaling extraction to fifty or a hundred properties stays sane when the schema does the work.

What a property looks like, in full

Most schema systems treat a property as a name and a type. That is enough for a CRM column. It is not enough for an LLM to extract reliably into.

In Personize, a property carries five pieces of information that together define an extraction contract:

{
  "name": "pain_points",
  "type": "array",
  "description": "Specific operational, technical, or business challenges this contact has mentioned facing, in their own words or close paraphrase. Used by AEs to prepare relevant solution framing.",
  "extractionInstructions": "Extract only challenges the contact themselves raised, not those mentioned by their colleagues unless the contact explicitly endorsed them. Combine duplicate mentions. Use the form 'short noun phrase' (e.g., 'API rate limits' not 'they are hitting API rate limits'). Drop generic complaints not tied to a specific workflow or system.",
  "examples": [
    { "input": "We keep hitting our API ceiling during the daily reconciliation run.", "output": ["API rate limits during daily reconciliation"] },
    { "input": "She mentioned vendor consolidation pressure from procurement.", "output": ["vendor consolidation pressure"] },
    { "input": "The team is frustrated with our current providers.", "output": [] }
  ],
  "measurementCriteria": "An extraction is correct when each entry is a noun phrase, attributed to a challenge this specific contact raised, and not present multiple times under different wording."
}

Each field does a different job.

name is the database column. The downstream CRM, the workflow that filters on it, the dashboard that aggregates it, they all use this name.

type is the validation contract. The extractor must produce an array of strings. The validator rejects anything else. The downstream system never has to handle malformed input.

description is what the LLM reads first. It explains, in plain language, what this property is for. Not what it is named, but what it means in this organization. "Pain points" means specific operational challenges the contact themselves raised. It does not mean general industry complaints. It does not mean things the contact's colleagues said. The description does the heavy lifting of disambiguating one property from another in the LLM's view.

extractionInstructions is where you put the rules the LLM keeps getting wrong if you do not say them. Combine duplicates. Use noun phrases. Drop generic complaints. These are the things you learned from the first hundred extractions where the model produced almost-right output that you wanted to nudge in a specific direction. Every team has these. Without a place to write them per-property, they end up scattered across a global prompt that nobody dares to edit.

examples is what calibrates the model. Three examples (one straightforward extraction, one with a transformation rule (paraphrasing), one negative (nothing to extract)) is usually enough. They live with the property, not in a global few-shot block, so adding a new property does not perturb the calibration of the existing ones.

measurementCriteria is what the evaluation harness and the AI-as-judge runs use. It is also what the schema author returns to when arguing about whether the extractor is working correctly. Without an explicit criterion, "is this right?" becomes a vibe check.

Schema-guided extraction: one property card with five contract pieces (name + type, description, extraction instructions, examples, measurement criteria) compiles into the extraction LLM's system prompt, which is composed from 30-50 such cards selected by content similarity per call and produces typed values across records

How the LLM sees a property at extraction time

The runtime composes the system prompt from the schema. For an extraction call against a piece of content, the system selects which properties are likely relevant (by embedding similarity between the content and the property descriptions), then includes the full property card (name, type, description, instructions, examples) for each selected property.

The model sees something like this for each selected property:

### pain_points (type: array)

Specific operational, technical, or business challenges this contact has
mentioned facing, in their own words or close paraphrase. Used by AEs
to prepare relevant solution framing.

Extraction rules:
Extract only challenges the contact themselves raised, not those mentioned
by their colleagues unless the contact explicitly endorsed them. Combine
duplicate mentions. Use the form 'short noun phrase' (e.g., 'API rate
limits' not 'they are hitting API rate limits'). Drop generic complaints
not tied to a specific workflow or system.

Examples:
- Input: "We keep hitting our API ceiling during the daily reconciliation run."
  Output: ["API rate limits during daily reconciliation"]
- Input: "She mentioned vendor consolidation pressure from procurement."
  Output: ["vendor consolidation pressure"]
- Input: "The team is frustrated with our current providers."
  Output: []

That is one property. The system prompt contains thirty to fifty of these cards, depending on tier. The LLM produces a structured output with one entry per included property, validated by the type definition.

The model is not asked to figure out what pain points are. It is told. The model is not asked to figure out what shape the answer should be. It is shown. The model's job is to read the content and apply the property card. The prompt engineering work that used to live in one shared system prompt now lives inside each property, owned by the team that owns the property.

Why this scales past the wall

The reason teams hit a wall around twenty hand-prompted properties is that the prompt becomes one shared resource that everyone is editing and nobody can confidently change. Add a sentence to clarify property A's extraction; suddenly property B's accuracy drops because the wording sounded contradictory to the model. The team owning B and the team owning A now have to coordinate on the prompt. Multiply this across thirty more properties.

Schema-guided extraction breaks the contention. The team owning pain_points owns the description, the instructions, the examples, and the measurement criteria for pain_points. The team owning buying_stage owns those for buying_stage. The system prompt at extraction time is composed from the property cards selected for this content. No team can break another team's property by editing their own.

The selection mechanism also reduces the prompt size for any given call. Not every content body needs all fifty properties evaluated. The embedding-based selector matches content against property descriptions and includes only the relevant ones, capped by the tier:

Tier	Max properties	Min similarity threshold	Chunk size
`basic`	15	0.4	2,000 words
`pro`	30	0.3	3,000 words
`pro_fast`	20	0.3	2,000 words
`ultra`	50	0.2	4,000 words

A pro-tier extraction against a sales call transcript pulls in the thirty most relevant properties by content-similarity, builds the system prompt from those thirty property cards, and runs the extraction. A property that is not relevant to this content (an e2e_test_status property, for an extraction against a discovery call transcript) never enters the prompt. Its absence does not perturb the calibration of the properties that did enter.

A schema with a hundred and twenty properties is workable, because any individual call sees at most fifty. Adding the hundred-and-twenty-first property does not affect the cost or accuracy of any existing extraction except calls whose content makes the new property relevant.

What the tier choice actually buys you

The four tiers are not a quality knob. They are a cost-versus-coverage knob.

basic is for high-volume, cost-sensitive extraction. Short content. Schemas with fewer than twenty properties. No PII redaction (the redaction tier is bundled into pro and above). Use it for ingestion of large CSVs where you know the data is clean and the property set is small.

pro is the default. Balanced. Thirty max properties, three-thousand-word chunks, full PII redaction, force-includes for the universal properties (full_name, email, company_name, job_title). This is the right tier for almost all production extraction work.

pro_fast is pro with smaller chunks. Same property cap. The chunking is tighter, which speeds up extraction on long documents at a small accuracy cost on context-dependent fields. Use it when latency matters and the content is long.

ultra is the heavy lifter. Fifty properties, four-thousand-word chunks, the lowest similarity threshold (which means properties with even loose relevance get included), and LLM reasoning enabled. Reasoning means the model can spend more tokens on inference before committing to an extraction, useful for properties whose value requires synthesizing across the document instead of pulling from a single sentence. Costs about three times pro.

The mistake to avoid is bumping the tier as a first response to bad output. The first thing to fix when extraction is wrong is the property card. Ninety percent of "ultra would work better" cases are actually "the description was ambiguous" or "the examples did not cover the case" or "the instructions had a contradiction." Tier escalation buys you reasoning depth. It does not buy you schema clarity.

The economics

The economic argument for schema-guided extraction shows up at month two, not month one. In month one, you could write a per-vertical extraction prompt by hand for each use case. The prompt would be tuned. The extraction would work. The team would feel like they had a system.

In month two, a new use case arrives. The new use case has overlap with the existing properties (twenty of the thirty you already have apply) and ten new properties. Now the question is whether to write a new global prompt, edit the existing one, or build a per-use-case selection layer that swaps properties in and out. Each option has a cost. None are cheap.

In month six, with fifteen use cases live, hand-prompting is no longer recoverable. The team is at the wall.

Schema-guided extraction pays its own engineering cost up front. Building the property card format, the selection mechanism, the validator, the calibration harness, that is real work. The payoff is that every property added after that work is cheap. The team owning competitor_count writes the card, ships it, and every existing extraction that ranks the new property as relevant starts producing it. No prompt engineering. No coordination. No risk of perturbing existing extractions.

For a system that is going to extract fifty or a hundred properties over its lifetime, this is the only architecture that holds together. The schema is your highest-leverage piece of prompt engineering, a system prompt that scales without becoming a tragedy of the commons.

Where it produces dual-modality memory

A consequence worth flagging: schema-guided extraction produces typed properties and atomic memories from the same LLM call.

The system prompt includes the property cards (for typed extraction) and the open-set extraction directive (for atomic memories). The model produces both outputs from the same content read. The atomic memories capture facts the schema did not anticipate, a hiring signal, a competitor mention, a casual reference to next quarter's priorities. The typed properties capture what the schema asked for. Both come from one pass, one cost, one LLM call.

This is the dual memory architecture at the extraction layer. The measured coverage from a controlled experiment: 38% of valuable information is captured only by open-set extraction, 12% only by schema-enforced, 34% by both, 16% missed by either. Combined recall reaches 82.8%. Neither modality alone gets there.

The fact that both modalities share the LLM call is what makes the cost work. You pay once for extraction; you receive two storage shapes; both fold into the unified retrieve as separate sources. The schema's job is to make the typed extraction reliable. The open-set extractor's job is to catch what the schema did not anticipate. Together they hit a recall ceiling neither could reach alone, at the cost of one LLM call per content body.

The principle

Stop writing extraction prompts by hand. Put the prompt-engineering work inside each property where it belongs.

The property is the unit. The team owning it owns its description, instructions, examples, and measurement criteria. The system composes the extraction prompt from the property cards selected per content. The schema scales because no two properties contend for the same shared prompt block.

When extraction goes wrong, the diagnosis is local. Look at the property card. Check the description for ambiguity. Check the instructions for contradictions. Check the examples for coverage gaps. Fix the property; the extraction improves for that property only; the rest of the schema is undisturbed.

This is what makes hundred-property schemas workable. It is what lets agents memorize at scale without the prompt becoming the bottleneck. It is the architectural decision that turns memory extraction from a science project into a system.

If you are building schema-enforced memory and your current pattern is a hand-written global prompt, audit one thing first. How many properties is it currently extracting, and how many more is your roadmap going to add this year? If the answer to the second question is more than ten, the wall is in front of you. Move the prompt engineering into the schema before you hit it.

Companion pieces: Three Modalities, One Save Surface on the broader save surface that consumes this schema. Dual Memory on the coverage numbers behind dual-modality extraction. One Call, Four Sources on the retrieval payload where typed properties appear alongside atomic memories.