Output quality saturates at roughly seven governed memories per entity. More context isn't better context — it's expensive noise.


TL;DR

  • In end-to-end evaluation, output quality plateaus at approximately seven governed memories per entity. Adding more doesn't improve results.
  • Progressive context delivery — injecting only new governance context at each step — cuts token usage by 50% with no quality loss.
  • The saturation curve validates a design principle: memory systems should optimize for signal density, not volume.
  • This has direct implications for context window management, cost, and the "lost in the middle" attention problem.

There's an assumption baked into most AI memory systems: more context is better. If the agent knows 5 things about a customer, knowing 50 must be 10x better. If you can fit 100 memories into the context window, why not?

We tested this assumption directly. The answer surprised us.

The Saturation Curve

Output quality vs. governed memories injected — steep rise to ~7 memories then flat plateau, with total +6.4 point gain over no-memory baseline

In our end-to-end evaluation — 10 prospects, 3 runs each, scored against a domain-specific rubric out of 100 — we measured output quality as a function of how many governed memories were available per entity.

The quality curve rises steeply from zero to about seven memories. Then it flattens. Additional memories beyond that threshold contribute negligible improvement to the output score. The system with the full governed memory pipeline scores 85.9/100, compared to 79.5/100 with no memory at all. That's a meaningful +6.4 point gain. But the gain concentrates in the first handful of memories.

Output quality saturates at approximately seven governed memories per entity.

This isn't about the total number of facts in the store. It's about how many the system selects and injects at inference time. A well-governed memory layer should have hundreds or thousands of facts per entity across months of interactions. The architecture's job is to surface the right seven.

Why Seven?

The number isn't magical. It reflects a structural property of how LLMs attend to context.

Liu et al. demonstrated that models under-use information in the middle of long contexts — the "lost in the middle" effect. As context length grows, attention on any individual piece of information dilutes. The model has more raw material but processes each piece with less focus.

Seven memories hit the sweet spot: enough to cover the entity's current state, key history, and relevant constraints, without enough volume to trigger attention dilution. Beyond that, you're spending tokens to deliver information the model effectively ignores.

This maps to how humans work too. A sales rep walking into a meeting doesn't review every interaction the company has ever had with the prospect. They review the deal summary, the latest call notes, the key objection, the competitive landscape, and maybe one or two relationship notes. Seven things. If someone handed them a 50-page dossier, they'd skim it and miss the critical detail on page 37.

The Cost Implication

Progressive context delivery vs. naive re-injection across 5 agent steps — 13 units vs. 50 units, 50% token reduction with no quality loss

Context isn't free. Every token in the context window has a cost — financial (API billing) and computational (inference latency). If output quality saturates at seven memories but you're injecting thirty, you're paying 4x for the same result.

This gets worse in autonomous multi-step workflows. An agent that plans, acts, observes, and re-plans might invoke context injection at every step. Without session awareness, the same memories and governance policies get re-injected at each step.

We built progressive context delivery to address this. The system maintains a session state tracking which governance variables and memories have already been delivered. On each routing call, already-delivered content is excluded. Only new or newly relevant content is injected.

The result: 50% token reduction across multi-step workflows, with no quality degradation. The agent gets delta context — what's changed since last step — rather than a full dump every time.

This is both a cost optimization and an accuracy optimization. Fewer redundant tokens means the model's attention stays concentrated on fresh, task-relevant information. The tokens you save aren't wasted capacity — they're noise you're removing.

What This Means for Architecture

The saturation finding has three architectural implications:

1. Memory selection matters more than memory volume.

The value of a memory system isn't measured by how much it stores. It's measured by how well it selects what to surface. A system with 1,000 memories per entity that surfaces the right 7 will outperform a system with 50 memories that dumps all of them into the context window.

This is where schema-enforced memory pays off. When memories have typed properties — job title, buying stage, pain points, budget range — the system can select based on task relevance, not just semantic similarity. "Preparing a renewal offer" should surface deal value, contract terms, and recent satisfaction signals, not the full history of every support ticket.

2. Quality gates at write time compound at read time.

If you don't filter noise before it enters the store, your memory selection has to work harder at retrieval time. Facts with unresolved coreferences ("he mentioned they're evaluating options" — who is "he"?), temporal ambiguity ("they recently switched" — when?), or low confidence dilute the pool that selection draws from.

We run three quality gates on every extraction batch: coreference scoring, self-containment scoring, and temporal anchoring scoring. These aren't expensive operations — they're lightweight heuristic checks. But they ensure the store contains clean, self-contained, temporally grounded facts. When the system selects seven memories from a clean store, they're good ones.

3. Progressive delivery is not optional for multi-step agents.

Any agent that operates in loops — and most production agents do — needs session-aware context delivery. The alternative is re-injecting the same governance policies and entity context at every step, consuming budget that should go to task-specific reasoning. At seven memories per step across a five-step workflow, that's 35 memory tokens injected. Without progressive delivery, it's 7 × 5 = 35 unique but delivered as 175 total tokens (many redundant). The math gets worse with governance variables, which tend to be longer than individual memory facts.

The Full Pipeline Effect

Four pipeline conditions — No memory (79.5), Raw memory (85.2), Open-set + governance (86.4), Full governed memory (85.9) — showing primary gain from memory, refinement from governance

Here's how the pieces fit in our end-to-end evaluation:

| Condition | Avg Score /100 | |---|---| | No memory | 79.5 | | Raw memory (no governance) | 85.2 | | Open-set memory + governance | 86.4 | | Full governed memory | 85.9 |

Memory provides the primary quality gain (+5.7 points). Governance routing adds refinement (+1.2 points). The full governed pipeline (+6.4 points) scores comparably because, for generation quality, open-set facts and governance context already provide the dominant signal.

But generation quality is only part of the story. Schema enforcement's primary value is realized downstream of generation: typed, validated property values enable reliable CRM synchronization, analytics aggregation, and structured API consumption that are orthogonal to whether the email sounded good.

The system that surfaces seven well-selected, quality-gated, governed memories consistently outperforms the system that dumps everything it knows into the prompt.

The Practical Takeaway

If you're building or evaluating a memory system for AI agents, test for saturation. Measure output quality as a function of injected context volume. If you find that quality plateaus — and you almost certainly will — design your architecture around selection quality, not storage volume.

The questions to ask:

  • How does the system decide which memories to surface? Semantic similarity alone, or task-aware selection?
  • Does the system track what's already been delivered in multi-step workflows?
  • Are there quality gates at write time, or does everything that gets extracted go into the store?
  • Can the system prioritize structured properties over raw observations when the task calls for it?

Seven isn't a magic number. It's an empirical signal that the bottleneck in agent memory is curation, not capacity.


Frequently Asked Questions

Does this mean I should only store seven memories per entity? No. Store everything. The saturation finding is about injection at inference time, not storage. A well-governed memory system should accumulate hundreds of facts per entity over time. The architecture's job is to select the right subset for each task.

What if my use case needs more context — like a legal review or a complex deal? The saturation point may shift depending on task complexity. Seven is the empirical finding for sales email generation. Longer-form tasks with more dimensions may saturate later. The principle holds: test for your saturation point and design around it.

How does this interact with larger context windows? Larger context windows don't change the attention problem. You can fit more tokens, but the model still attends less effectively to information in the middle. Saturation is about signal density, not window capacity.