Beyond Fact Count: Measuring What Actually Matters in Agent Memory Extraction

Twelve memories. Five were names and titles. The rest were descriptions you could have copied from the homepage. That was the baseline.

We ran a straightforward test. We developed a list of company websites and scraped their public content: about pages, product descriptions, customer stories, team bios. Then we fed it through our own Personize Memory, which demonstrated the highest accuracy on standard extraction benchmarks like LoCoMo, the same datasets the leading memory technologies in the market use to measure their quality.

The system returned 12 memories. Five were people's names and job titles. The remaining seven were surface-level descriptions: "GreenPath Solutions is an energy efficiency company," "GreenPath Solutions offers consulting services," "GreenPath Solutions works with commercial buildings." Accurate. Technically correct. Top of the leaderboard. But not enough.

When I went back to the raw website content, the gap was obvious. I could see insights that would help AI agents serve leads at each of these companies: competitive positioning for writing proposals, buyer motivations for crafting emails, strategic priorities for planning outreach, organizational signals for better decision-making. The raw data was rich with material that would make agents more meaningfully relevant, more capable of reasoning about the company they were engaging with. But the nature of the metrics we were competing on, the same metrics every memory system optimizes for, did not give value to any of this. The benchmarks rewarded extracting facts. They did not reward extracting intelligence.

Then we changed the extraction approach. Same content, same model, same pipeline. But with guided extraction -- entity-type awareness, strategic depth instructions, smart splitting rules. The output: 19 to 26 rich, actionable memories depending on the run. Products with positioning. Customer stories with metrics. Competitive differentiators. Executive perspectives. Value propositions with enough substance that an agent could write a compelling proposal paragraph from a single memory.

But the real story is not the count. Going from 12 to 26 is nice for a dashboard metric. What actually changed was what got extracted -- and that change happened across several distinct levels, each building on the last.

Level 1: Entity-Type Awareness

This is what most memory systems, including ours, were optimized for. And for good reason: the dominant use case for agent memory has been chatbots and conversational agents. Extract facts from conversation logs, sales call transcripts, CRM notes. Names, titles, and summary statements are exactly what that lens is designed to find. It was doing its job correctly. It was just doing the wrong job.

The problem is that we do not only serve chatbots. We support workflows, databases, and integrations where the data comes from varied sources, in different formats, describing different types of entities. A company website is not a conversation transcript. Extracting it with the same lens means the system never asks the question that matters: what kind of entity am I looking at?

When we told the extraction system "this is a Company record," everything shifted. The LLM was no longer looking for contact information and conversation summaries. It was looking for products, services, pricing, competitive positioning, customer proof, market strategy, and organizational structure. Same content. Completely different extraction because the model now knew what mattered.

The topic distribution tells the story clearly. Generic extraction produced memories clustered across 2 to 3 topic categories, with most falling into "other" -- the catch-all for facts that do not fit a recognized pattern. Guided extraction produced memories spanning 14+ distinct topics: product_feature, customer_story, team_member, pricing, market_position, value_proposition, methodology, industry_focus, partnership, competitive_differentiator, and more.

This is not a minor improvement. It is a categorical change in what the system considers worth remembering. And it required exactly one piece of additional context: the entity type.

Level 2: Strategic Depth

Entity awareness gets you the right categories. Strategic depth gets you useful content within those categories.

Consider two memories extracted from the same source material:

"GreenPath Solutions has a product called BuildIQ."

"GreenPath Solutions' BuildIQ platform connects building operations teams with real-time energy data for proactive maintenance, continuous optimization, and reduced operational costs."

Both are true. Both are accurate extractions from the same content. Only the second is useful for an agent writing a proposal, crafting an outreach email, or preparing competitive analysis. The first is a label. The second is intelligence.

The difference comes from teaching the extraction system what "useful" means in context. For company records, we instruct the model to look for: value propositions with specific benefits, competitive positioning relative to alternatives, executive quotes that reveal strategic priorities, customer testimonials that include measurable outcomes, and methodology descriptions that signal how the company operates.

These are the facts that make downstream agent work compelling rather than generic. An agent with the first memory writes: "GreenPath Solutions offers BuildIQ." An agent with the second memory writes: "Their BuildIQ platform gives your operations team real-time visibility into energy performance -- proactive maintenance instead of reactive firefighting." Same source. Radically different output quality.

Level 3: Smart Splitting (The Quality Trap)

Here is where the story gets counterintuitive.

After implementing entity awareness and strategic depth, we ran the extraction again with aggressive splitting -- break every compound statement into individual atomic facts. For one company, the memory count jumped from 12 to 101. On a dashboard, this looked like a massive improvement. Five times the coverage. Rich topic diversity. Every fact isolated and independently retrievable.

Then we analyzed the output. 149 near-duplicate pairs. Fourteen junk memories -- copyright notices, "GreenPath Solutions has a careers page," navigation labels stored as facts. Ten product features that had been over-atomized into individual memories when a single consolidated memory capturing the feature set would have been far more useful.

The system had learned to extract more. It had not learned to extract better.

Two rules fixed it. First, smart splitting: keep lists and feature sets together unless individual items carry unique detail worth preserving independently. "BuildIQ provides real-time monitoring, predictive maintenance, energy benchmarking, and automated reporting" is one useful memory. Splitting it into four memories -- one per feature -- just creates four fragments that are less useful individually than the whole.

Second, noise filtering: skip boilerplate. Copyright notices, navigation elements, "click here to learn more," career page mentions, cookie policy references. These are not facts worth storing.

After applying both rules, the count dropped to 18 to 22 memories. Every single one was actionable. The junk was gone. The duplicates were gone. The over-atomized fragments were consolidated.

The lesson is important: extraction quality is not monotonically increasing with extraction volume. More memories can mean worse performance if the additional memories are noise that pollutes the context window at retrieval time. The right metric is not memory count. It is downstream task utility.

Level 4: Implied Context

This is the subtlest level, and the one that separates good extraction from genuinely intelligent extraction.

The source content describes GreenPath Solutions as a "design-build energy efficiency firm." A human analyst reading that immediately understands something that is not explicitly stated: this company does both consulting AND implementation. That is a competitive differentiator versus pure consulting firms that hand off a report and walk away. It signals end-to-end ownership, accountability for outcomes, and a different sales conversation than "we will advise you."

Generic extraction stores the phrase. Guided extraction stores the implication: "GreenPath Solutions differentiates through a design-build model -- they handle both the consulting/design phase and the physical implementation, providing end-to-end accountability that pure consulting competitors cannot match."

Other examples of implied context that guided extraction captures:

Event attendee descriptions like "industry leaders, systems engineers, and facility managers" reveal the company's market positioning tier. They are not selling to small businesses. They are operating in enterprise and institutional markets.

Customer segment language like "property management companies seeking to reduce operating costs and meet ESG targets" reveals buyer motivation. The company's customers are not buying energy efficiency for environmental reasons alone -- they are buying cost reduction with ESG compliance as a secondary benefit. That changes how an agent should position any outreach.

Methodology language like "data-driven decision making and measurable outcomes" signals a consultative selling approach. This company leads with analytics, not relationships. An agent preparing a pitch should emphasize ROI projections and measurement frameworks, not personal rapport.

These implied signals are the highest-value extractions for agents doing personalized outreach, competitive analysis, or proposal writing. They are also the extractions that generic systems miss entirely, because the explicit text never states them directly.

Level 5: Cross-Entity Intelligence

Content memorized against a Contact record often contains far more than contact facts. A sales rep's research notes about a prospect include company analysis, product details, deal context, competitive landscape, and market observations. A strictly typed extraction -- "this is a Contact record, only extract Contact facts" -- throws away everything except the person's name, title, and a few personal details.

That is a massive waste. The company research in those notes is valuable. The product analysis is valuable. The competitive observations are valuable. They just happen to live in a record associated with a person rather than a company.

The fix: treat entity type as context for prioritization, not a filter. A Contact record that contains company research should absolutely produce company memories. They describe the contact's world -- the organization they work in, the products they use, the challenges they face. An agent preparing for a meeting with that contact needs all of it, not just their job title.

Cross-entity extraction captures every valuable fact regardless of which record type it is attached to. The entity type tells the system what to prioritize, not what to ignore.

Level 6: What Existing Benchmarks Miss

Current evaluation frameworks for memory systems -- LongMemEval, LoCoMo, and similar benchmarks -- measure whether a system can recall facts correctly. Given a set of conversations, can the system answer questions about what was said? This is necessary but nowhere near sufficient.

These benchmarks do not measure:

Actionability. Can an agent USE this memory to complete a task? A memory that says "GreenPath Solutions exists" recalls correctly but enables nothing. A memory that captures their value proposition, target market, and competitive positioning enables an agent to write a personalized proposal paragraph. Both score the same on recall benchmarks.

Richness. Does the memory capture substance or just a label? "Has a product called BuildIQ" and "BuildIQ connects operations teams with real-time energy data for proactive maintenance and cost reduction" are both accurate extractions. The second is dramatically more useful. No current benchmark distinguishes them.

Noise ratio. What fraction of extracted memories are junk? A system that extracts 100 memories including 15 copyright notices and navigation labels looks great on volume metrics. At retrieval time, those 15 noise entries compete for context window space with the 85 useful memories.

Topic diversity. Does extraction cover multiple aspects of the entity or cluster on the obvious ones? Twelve memories that are all variations of "GreenPath Solutions is an energy efficiency company" score well on recall but provide almost no breadth.

Downstream task quality. Do better memories produce better agent outputs? This is the metric that actually matters, and no standard benchmark measures it. A system that extracts 100 memories and scores 95% recall may perform worse in practice than one that extracts 20 memories -- if the 100 include 40 duplicates and 15 noise entries that pollute the context window and degrade the agent's reasoning.

The gap between what benchmarks measure and what production systems need is significant. Recall accuracy is table stakes. The real differentiator is extraction intelligence -- knowing what is worth remembering, at what granularity, with what implied context.

Where This Leads

The industry is converging on memory extraction as a critical capability for agent systems. GraphRAG demonstrated that structured knowledge representations outperform raw chunk retrieval. MemGPT showed that memory management is as important as memory storage. RAPTOR proved that hierarchical summarization improves retrieval across abstraction levels. Self-RAG established that retrieval quality matters more than retrieval quantity.

On the extraction side, work on structured information extraction -- GoLLIE, InstructUIE, and related systems -- shows that guided extraction with explicit schemas dramatically outperforms open-ended extraction. The pattern holds across domains: tell the model what to look for and it finds more of it, with higher quality.

But the conversation in the agent memory community is still largely stuck on "did we extract the fact?" when the real question is "did we extract something an agent can act on?" The levels described above -- entity awareness, strategic depth, smart splitting, implied context, cross-entity intelligence -- are the difference between a memory system that stores text and one that understands what it is storing.

The benchmarks will catch up. They always do. In the meantime, if you are building agent memory, measure what matters: not how many facts you extracted, but what an agent can do with them.

References

GraphRAG -- Microsoft Research (2024). "From Local to Global: A Graph RAG Approach to Query-Focused Summarization." Structured knowledge graph construction from source documents for improved retrieval.
MemGPT -- Packer et al. (2024). "MemGPT: Towards LLMs as Operating Systems." Virtual context management with tiered memory for long-term agent interactions.
RAPTOR -- Sarthi et al. (2024). "RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval." Hierarchical clustering and summarization for multi-level retrieval.
Self-RAG -- Asai et al. (2024). "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection." Adaptive retrieval with self-assessment of generation quality.
GoLLIE -- Sainz et al. (2024). "GoLLIE: Annotation Guidelines improve Zero-Shot Information-Extraction." Code-based schema definitions for guided extraction with LLMs.
InstructUIE -- Wang et al. (2023). "InstructUIE: Multi-task Instruction Tuning for Unified Information Extraction." Instruction-tuned extraction across entity, relation, and event types.
Zep -- "Building Long-Term Memory for AI Agents" (2024). Production perspectives on memory extraction pipelines and knowledge graph construction.
Letta -- "The Future of AI Agent Memory" (2025). Tiered memory architectures and the case for structured memory over raw storage.