Guided Memory Extraction: Why Domain Expertise Belongs in Your Memory Pipeline

Raw content is what was said. Extracted memory is what was meant. Guided extraction is what makes the difference.

When we built our memory system, we focused on what every serious memory team focuses on: the benchmarks. Atomic fact accuracy. Coreference resolution. Temporal reasoning. Multi-step recall. These are the metrics the research community treats as the scoreboard, and we optimized hard for all of them. We delivered best-in-class accuracy in memorization, recall, governance delivery, and token efficiency. The system worked, and it worked well.

But working across different companies and industries, I started noticing a pattern. A sales intelligence team using company profiles. A recruiting platform working with candidate records. A customer success tool tracking account histories. The memory pipeline was accurate. The recall was precise. The quality was strong across the board. And still, I could see a gap worth closing.

Different entity types require memorizing fundamentally different things. A Company record and a Contact record extracted from the same sales call should produce different memories because different facts matter for each. And it goes deeper: the same entity type in different industries needs different extraction priorities. A "Company" in construction cares about fleet size and job site coverage. A "Company" in SaaS cares about ARR and integration partnerships.

You can score perfectly on atomic accuracy and coreference resolution and still leave high-value facts on the table. The standard metrics measure whether you remembered correctly. They don't measure whether you remembered the right things. That's where the meaningful improvement lives: extending expert reasoning and domain expertise into the extraction itself.

Not just "extract facts from this text," but "extract the facts that matter for this entity type, in this industry, for this use case." We call it guided extraction, and the difference is dramatic.

Here's one example. Take company research: scraped website content, a press release, product pages. With generic extraction ("extract memories from this text"), the system returns a handful of memories. Most are people's names and job titles. The rest are surface-level descriptions no more useful than the original content.

Now change one thing: tell the extraction system the entity type. "This is a Company record. Read it like a competitive intelligence analyst." Same content, same model, same pipeline. The system now extracts memories covering products, pricing tiers, target industries, named customers, executive quotes, competitive positioning, expansion strategy, and integration partnerships. Not just more facts, but fundamentally different facts. The kind an agent can actually act on.

The rest of this article unpacks both sides: why extracted memories are fundamentally superior to raw content for AI agents, and why guided extraction is what turns good memory into great memory.

Part I: Why Memories Over Raw Content

Most AI systems store content and retrieve it later. Ingest documents, chunk them, embed them, return the top-k most similar chunks at query time. For document search, this works well. For agents that need to act on accumulated knowledge, it breaks down in seven specific ways.

1. Retrieval Precision

Raw content retrieves by chunk similarity. You query "what industries does this company serve?" and get a 500-word block that's vaguely related. Somewhere in those 500 words is the answer, buried among navigation menus, marketing copy, and unrelated descriptions.

Extracted memories retrieve by fact similarity. The same query returns: "Acme Corp serves construction, oil and gas, inspection, maintenance, and transportation industries." Directly usable. No sifting.

Research on retrieval-augmented generation consistently shows that precision of retrieved context directly correlates with output quality. Noisy context degrades LLM performance even when the relevant fact is present. The model doesn't just ignore irrelevant content -- it actively hedges, mixes up entities, and loses confidence in facts it would otherwise state clearly.

The needle-in-a-haystack problem isn't just about finding the needle. The hay itself makes the model worse.

2. Cross-Document Synthesis

Raw content is trapped in its source document. If you scraped five pages about a company, you have five separate chunks stored independently. An agent asked "who are their customers and where are their offices?" might find customers on page 3 but miss office locations on page 5 because the embeddings aren't similar enough to surface both.

Extracted memories are source-independent. "BuildRight Construction is a customer of Acme Corp" and "Acme Corp has offices in Calgary and Houston" exist as independent facts in the same memory store. They were extracted from different pages, but at retrieval time, they're equal citizens. An agent composes them naturally: "They already have construction clients like BuildRight, and with their new Houston office, they're clearly investing in the US market."

This is knowledge distillation -- converting unstructured content into a structured knowledge base that supports compositional reasoning. The power isn't in any individual fact. It's in the ability to combine facts from sources that never knew about each other.

3. Token Efficiency

Consider an AI agent preparing for a sales call:

Raw content approach: Retrieve 5 relevant chunks. Total context: roughly 2,500 tokens. Useful information: maybe 200 tokens. That's 92% waste -- navigation menus, markdown formatting, sentences that happen to share embedding similarity with the query but contain nothing actionable.

Memory approach: Retrieve 15 relevant memories. Total context: roughly 300 tokens. Every token is a usable fact.

At scale, an agent with 20 context items from raw chunks burns 10,000+ tokens. The same agent with 20 memories uses 600 tokens. That's 16x more efficient -- cheaper API calls, faster responses, and room for more facts within the same context window.

But efficiency isn't just about cost. LLMs perform better with less, higher-quality context than with more, noisier context. The Self-RAG paper (2024) demonstrated that retrieved context quality matters more than quantity across multiple benchmarks. You're not just saving tokens. You're making the model smarter by giving it less to reason about.

4. Temporal Coherence

Raw content is a frozen snapshot. If you scraped a pricing page in January and again in April, you have two overlapping chunks with conflicting information. The LLM sees both and doesn't know which is current. It hedges, or worse, confidently states the wrong number.

Extracted memories with timestamps can be deduplicated and superseded. The system knows that "Starter plan: $49/month (January)" was replaced by "Starter plan: $59/month (April)." At retrieval time, only the current fact surfaces. The agent states the correct price without hedging.

This extends beyond simple value changes. People change roles. Companies pivot their positioning. Deals progress through stages. Budgets get revised. Raw chunks from different conversations coexist in the store with no mechanism to reason about which truth is current. Extracted, timestamped memories make temporal reasoning possible.

An agent writing a personalized email needs facts from different sources: company research from a website, relationship history from a CRM, recent interactions from emails, and market context from news. Raw content from these sources has completely different formats. A scraped webpage is HTML with navigation menus. A CRM note is three sentences. An email is a conversation thread. A news article is structured prose.

Extracted memories normalize everything into the same format. A memory from a website looks exactly the same as a memory from an email looks exactly the same as a memory from a CRM note: a self-contained fact with keywords, entities, a topic label, and a timestamp.

This normalization happens at write time, not query time. By the time an agent retrieves facts from three different sources, they're composable because they're structurally identical. Without extraction, the agent receives an HTML page, a three-line note, and an email thread -- three fundamentally different documents requiring three different parsing strategies before any reasoning can begin.

6. Controllable Recall

With raw content, you can only filter by vector similarity. The query "competitive positioning" returns whatever chunks happen to have high cosine similarity with that phrase -- which might include an About page, a product comparison, or an unrelated paragraph that mentions the word "competitive" in a different context.

With extracted memories, you can filter by structured metadata:

By topic: Only competitive_positioning and value_proposition facts.
By entity: Everything about a specific company.
By person: What do we know about a specific executive?
By time: What changed in the last quarter?
By confidence: Only high-confidence, directly stated facts.

"Get me their competitive positioning and any executive quotes" returns exactly two or three facts, not five loosely related paragraphs. This surgical precision is an architectural capability that raw content fundamentally cannot provide. You can filter chunks by source or date, but you can't filter by what the content means without first extracting what it means.

7. The Research Basis

The case for structured extraction over raw retrieval has been building across multiple research directions:

Knowledge Graphs vs. Vector RAG. Papers from 2023-2025 show structured knowledge representations outperform raw-chunk RAG on multi-hop reasoning tasks by 15-30%. Microsoft's GraphRAG demonstrated that entity-relationship structures enable reasoning flat retrieval cannot.

MemGPT and MemoryBank (2023-2024). LLM agents with structured memory stores outperform agents with raw conversation history on long-horizon tasks. Summarization into atomic facts enables unbounded memory, while raw content hits context limits.

RAPTOR (2024). Hierarchical summarization improves retrieval quality over flat chunking by 20%+ on QA benchmarks.

Self-RAG (2024). Retrieved context quality matters more than quantity. Fewer, higher-quality passages outperform more passages across benchmarks.

The convergence points to the same conclusion: the write-time decision of what to store determines the ceiling of what the system can retrieve.

Part II: Why Guided Extraction Changes Everything

The seven reasons above make the case for extraction over raw storage. But as I described in the opening, accurate extraction alone isn't enough. The next question is: what should the extraction system look for?

The default answer is "everything." Extract all facts. Be comprehensive. The more you extract, the more you'll have to work with later. This sounds reasonable and is completely wrong.

Generic Extraction is the Intern Taking Notes

When you tell an LLM "extract facts from this text," you're asking an intern to take notes at a board meeting. They'll capture what's obvious: names, titles, dates, numbers. They'll miss what matters: the strategic subtext of the CEO's remarks, the competitive implication of a product pivot, the fact that three specific customer logos on a website signal enterprise credibility.

Generic extraction produces generic memories. A company website processed with no domain guidance yields: "John Smith is the CEO." "The company was founded in 2018." "They have offices in Austin." True facts. Useless for an agent trying to craft a compelling outreach or assess competitive positioning.

Expert Extraction is the Analyst Briefing You

Guided extraction changes the frame. Instead of "extract facts," the system is told: "You are analyzing a Company record. Adopt the perspective of a competitive intelligence analyst. Extract value propositions, competitive positioning, executive quotes revealing strategic intent, named case studies, customer proof points, and buyer-relevant details."

Same content. Radically different output. The extraction now captures: "The CEO stated the company is positioning itself at the heart of the industries it serves, aiming to help clients recover revenue lost to inefficiencies." That's a value proposition an agent can use in outreach. The generic extraction didn't even notice it was there.

This isn't prompt engineering in the conventional sense. It's a design decision about what counts as knowledge for a given entity type, and it changes the entire output.

Entity-Type Awareness

The most impactful form of guided extraction is entity-type awareness -- telling the extraction system what kind of record it's building a profile for.

When the system knows it's analyzing a Company, it looks for: products, features, pricing tiers, target industries, named customers, integrations, competitive positioning, executive leadership, geographic footprint, funding history, and strategic direction.

When it's analyzing a Contact, it looks for: role, responsibilities, pain points, budget authority, decision timeline, technology preferences, communication style, and relationship context.

When it's analyzing a Pitch Deck, it looks for: market size, traction metrics, team composition, ask amount, competitive landscape, unit economics, and customer acquisition strategy.

The entity type acts as a lens that focuses the LLM's attention on what matters. Without it, the LLM defaults to the safest, most obvious facts -- names and titles -- because it has no signal about what downstream agents will need.

This aligns with recent findings in information extraction research. GoLLIE (Sainz et al., EMNLP 2023) demonstrated that providing annotation guidelines -- essentially telling the model what to look for and how -- significantly improves zero-shot extraction quality. InstructUIE (Wang et al., 2023) showed that task-specific instructions enable a single model to outperform domain-specific fine-tuned extractors across 32 datasets. The principle is the same: guided extraction consistently beats generic extraction, whether the guidance comes from annotation schemas or entity-type-aware prompting.

We tested this across multiple entity types and content formats. In every case, entity-type-aware extraction produced memories that were more diverse in topic coverage, more specific in detail, and more useful for downstream agent tasks than generic extraction of the same content.

Strategic Depth: Beyond Surface Facts

Even with entity-type awareness, extraction can still be shallow. A company website says "We help companies work smarter, invoice faster, and grow revenue." A surface-level extractor captures the sentence. A strategically-aware extractor recognizes this as a value proposition and tags it accordingly.

The distinction matters because AI agents don't just need facts -- they need the kind of facts that enable action. An agent writing an outreach email needs value propositions. An agent preparing a competitive analysis needs positioning statements. An agent qualifying a lead needs buying signals. If the extraction system doesn't recognize these categories at write time, the agent can't retrieve by them at query time.

We found that explicitly instructing the extraction to look for strategic intelligence -- value propositions, competitive positioning, executive quotes, social proof, and buyer-relevant details -- consistently surfaced facts that generic extraction missed entirely. These are often the highest-value facts in any content, and they're the first things a generic extractor drops because they don't look like "facts" in the traditional sense. "We're building the operating system for modern manufacturing" is a vision statement, not a datapoint. But it's exactly what an agent needs to understand what this company believes about itself.

The Quality Trap: Why More Extraction Isn't Better

Here's where it gets counterintuitive. When we first added aggressive extraction rules, memory counts jumped dramatically from the same content. The dashboard looked great.

Then we analyzed the output. Hundreds of near-duplicate pairs -- the same fact stated slightly differently because of overlapping content sections. Memories that were pure noise: copyright notices, "has a careers page," generic contact phone numbers. Features extracted as ten individual memories ("offers a Tickets feature," "offers a Timesheets feature") when the useful fact was one consolidated memory: "features include tickets, timesheets, invoicing, and dispatch."

Two rules fixed it. Smart splitting: keep lists together unless each item has its own unique detail. "Features include A, B, C, and D" is one memory. "Feature A reduces downtime by 40%. Feature B saves $2M annually." is two memories. Noise filtering: skip copyright notices, navigation links, boilerplate, and generic contact information that no agent would ever use.

Memory counts dropped significantly. Quality went up dramatically. Every remaining memory was actionable. Zero duplicates. Zero noise.

The lesson: extraction quality is not a monotonically increasing function of extraction volume. The metric that matters isn't how many facts you extracted. It's how many of those facts an agent would actually use. More memories in the store means more noise in the context window, which means worse agent performance.

The Uncomfortable Implication

Most teams building AI agents spend their optimization budget on retrieval. Better embeddings. Hybrid search. Re-ranking. Multi-step retrieval with tool use. These all operate on the same assumption: the content in the store is good enough, we just need to find the right pieces of it.

The evidence suggests the real leverage is upstream. Two layers upstream, in fact.

The first layer is the decision to extract structured memories instead of storing raw content. This is the architectural choice that determines whether your agent works with facts or fragments.

The second layer is the decision to guide that extraction with domain expertise, entity awareness, and strategic depth. This is the quality choice that determines whether those facts are actionable or superficial.

A perfect retrieval algorithm operating on shallow, noisy memories from a generic extraction pipeline will lose to a simple vector search operating on deep, clean memories from a guided extraction pipeline. The ceiling is set at write time. Everything downstream -- retrieval, re-ranking, agent reasoning -- can only work with what was stored.

Raw content is what was said. Extracted memory is what was meant. Guided extraction is what matters. The agents that get all three right will be the ones that actually work.

References

Microsoft Research -- "GraphRAG: Unlocking LLM Discovery on Narrative Private Data" (2024): https://microsoft.github.io/graphrag/
Packer et al. -- "MemGPT: Towards LLMs as Operating Systems" (2023): https://arxiv.org/abs/2310.08560
Sarthi et al. -- "RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval" (2024): https://arxiv.org/abs/2401.18059
Asai et al. -- "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection" (2024): https://arxiv.org/abs/2310.11511
Sainz et al. -- "GoLLIE: Annotation Guidelines improve Zero-Shot Information Extraction" (EMNLP 2023): https://arxiv.org/abs/2310.03668
Li et al. -- "Large Language Models for Generative Information Extraction: A Survey" (ACL 2024): https://arxiv.org/abs/2312.17617
Wei et al. -- "Zero-Shot Information Extraction via Chatting with ChatGPT" (2023): https://arxiv.org/abs/2302.10205
Kong et al. -- "Better Zero-Shot Reasoning with Role-Play Prompting" (2023): https://arxiv.org/abs/2308.07702
Wang et al. -- "InstructUIE: Multi-task Instruction Tuning for Unified Information Extraction" (2023): https://arxiv.org/abs/2304.08085
Zep AI -- "Stop Using RAG for Agent Memory" (2025): https://blog.getzep.com/stop-using-rag-for-agent-memory/
Letta -- "RAG is not Agent Memory": https://www.letta.com/blog/rag-vs-agent-memory