Transparent breakdown of our experimental results: what we tested, what the numbers prove, what they don't, and why we benchmark against ourselves honestly.
TL;DR
- 99.6% fact recall is on synthetic data — it's an upper-bound demonstration, not a claim about arbitrary real-world inputs.
- 74.8% on LoCoMo is an external benchmark confirming that governance and schema enforcement impose no retrieval quality penalty. On open-ended inference, we exceed human-level (83.6% vs. 75.4%).
- Independently evaluated memory systems report 42–67% on comparable settings. But methodologies vary — direct comparison across systems is unreliable.
- We designed experiments as repeatable monitoring templates, not one-shot benchmarks. Small controlled datasets are intentional.
When you publish numbers like 99.6% and 74.8%, you owe people context. What was tested. What conditions. What the numbers actually prove versus what they don't. This post is that context.
The Internal Experiments
We ran controlled experiments across 250 synthetic samples spanning five content types: call notes, documents, emails, transcripts, and chat logs. Synthetic data with embedded ground truth — known fact counts, known property values, known coreference issues, known near-duplicates.
Extraction Quality (E1)
| Content Type | Samples | Fact Recall | |---|---|---| | Call notes | 50 | 100% | | Documents | 50 | 100% | | Emails | 50 | 100% | | Transcripts | 50 | 100% | | Chats | 50 | 98% | | Overall | 250 | 99.6% |
99.6% fact recall is consistently high regardless of content format. But these are synthetic datasets.The samples are structurally diverse — they stress distinct extraction challenges like coreference resolution, temporal reasoning, and implicit facts. But they're free of the noise, formatting inconsistencies, and ambiguity typical of production data.
The honest interpretation: the extraction architecture and algorithm reliably capture ground-truth facts under controlled, diverse conditions. This is an upper-bound demonstration of the pipeline's capability. Production deployments exhibit comparable but modestly lower recall, consistent with the additional noise in organic content.
We say this in the paper. It matters to say it here too.
Quality Gates (E9)
On 40 samples, retrieval with quality gates reduces the output defect rate by 25% relative compared to raw retrieval (6.3% vs. 8.4%). Temporal accuracy improves by 6.8 percentage points. Signal-to-noise ratio increases from 1.1:1 to 4.2:1.
These gains come from filtering coreference-unresolved, non-self-contained, and temporally ambiguous facts before they enter the store. Downstream retrieval inherits cleaner signal. Decision precision at 94.5% validates the gate as a reliable guard rather than a coarse filter.
Dual Memory Complementarity (E12)
| Category | % of Total | |---|---| | Captured by both modalities | 34% | | Open-set only (long-tail insights) | 38% | | Schema-enforced only (typed values) | 12% | | Missed by both | 16% | | Combined recall | 82.8% |
The 38% captured exclusively by open-set memory — relational facts, qualitative observations, contextual details — would be permanently lost in a schema-only system. The 12% captured exclusively by schema enforcement would lack type validation in an open-set-only system.
16% is missed by both. We're transparent about that. Some information in unstructured content resists extraction regardless of modality.
Governance Routing (E3, E13)
92% precision, 88% recall across 20 diverse task types and 25 governance variables. The finding that mattered most operationally: well-authored governance variables are 20–50 percentage points more discoverable than poorly-authored ones. In 3 of 5 categories, poorly-authored variables scored 0% discovery rate. Governance quality is an authoring problem as much as a routing problem.
Reflection-Bounded Retrieval (E10)
| Condition | Avg Completeness | |---|---| | No reflection (baseline) | 37.1% | | API-managed, 1 round | 40.4% | | Manual multi-hop, 1 round | 61.2% | | Manual multi-hop, 2 rounds | 62.8% |
+25.7 percentage points on hard multi-faceted queries. But the insight is in the breakdown: query generation strategy is the key determinant, not round count. Manual multi-hop dramatically outperforms API-managed reflection. One round gets most of the way to the two-round ceiling. We're actively working on improving API query generation to close this gap.
Entity Isolation (E11)
Zero true cross-entity leakage across 3,800 results under adversarial conditions. I wrote about this in detail in Zero Cross-Entity Leakage Across 3,800 Results.
Adversarial Governance (E15)
100% compliance across 50 adversarial scenarios designed to bypass governance constraints. Zero organizational policy leakage. 96% guardrail activation rate — 2 easy-category inputs resolved correctly without explicit guardrail invocation, which is acceptable behavior (the system gave the right answer through a different path).
Semantic Conflict Resolution (E14)
When the same entity accumulates contradictory facts over time — a company changes its primary database, a contact switches roles — the system must surface the most recent claim. Across 30 conflict pairs:
- 83.3% conflict detection (fresh information surfaced in the answer)
- 33.3% strict recency accuracy (answer contains only fresh keywords)
- 3.3% incorrect stale (1 out of 30)
The gap between 83.3% and 33.3% is instructive. In 15 of the 30 cases, the answer leads with the fresh claim but references the stale value as transition context — "migrated from AWS to Google Cloud." That's actually good behavior for many use cases. The agent isn't confused; it's providing the trajectory. Whether that's desirable depends on the task.
The External Benchmark: LoCoMo
Internal experiments on synthetic data prove the architecture works under controlled conditions. They don't prove it works as a general-purpose memory system. For that, we went to LoCoMo.
LoCoMo is an independent benchmark for long-term conversational memory: 272 sessions, 1,542 questions across 10 conversations. It tests single-hop retrieval, multi-hop reasoning, temporal updates, and open-ended inference.
| Category | Our Accuracy | vs. Human Baseline | |---|---|---| | Single-hop | 78.7% | −16.4pp | | Multi-hop | 51.7% | −34.1pp | | Temporal | 64.6% | −28.0pp | | Open-ended | 83.6% | +8.2pp | | Overall | 74.8% | −13.1pp |
Three things matter here:
1. Governance imposes no retrieval penalty. This is the core architectural question the benchmark answers. We added schema enforcement, organizational governance routing, entity-scoped isolation, and progressive context delivery on top of a memory system. Did all that machinery degrade retrieval quality? No. 74.8% is state-of-the-art among the enterprise memory systems we've seen evaluated.
2. Open-ended inference exceeds human-level. 83.6% vs. 75.4% on the largest category (841 questions). This is where retrieval stops and memory starts — synthesizing facts that were never explicitly stated from accumulated context. It makes sense: the system has perfect recall of stored facts and can synthesize across them without the attention limitations of a human reviewing a long conversation.
3. Multi-hop and temporal are active gaps. 51.7% on multi-hop and 64.6% on temporal. These are the hard categories and they remain optimization areas. We're not hiding that.
The Comparison Problem
For context, independently evaluated memory systems report: Mem0 at 64–67%, Zep at 42–66%, OpenAI built-in memory at ~53%. Our 74.8% is the highest we've seen.
But I want to be careful here. Published systems use varying methodologies — pure token-overlap F1, pure LLM-as-judge, or hybrid approaches. We use a hybrid text-match-first / LLM-judge-fallback methodology; 950 of 1,153 correct answers (82.4%) scored via text-match. These methodological differences mean scores are not directly comparable across systems.
The honest framing: we outperform published baselines on the same benchmark, but the comparison is approximate, not exact.
Why Small Datasets Are Intentional
A common reaction: "250 samples? That's tiny." It's intentional.
These experiments serve a dual purpose: validating architectural mechanisms and defining a repeatable evaluation methodology that organizations can apply as ongoing operational monitoring. Each experiment targets a specific governance concern and maps directly to a production health signal.
We validated sample sizes by scaling: experiments began at N=10 per content type and were incrementally increased to N=50. Key metrics (fact recall, routing precision, entity isolation) stabilized with minimal variance beyond N=30 per type. N=250 is sufficient for the effects being measured.
The metrics we introduce — governance routing precision and recall, schema discovery rate, context defect rate, memory density curves — are proposed as standard instrumentation for production memory systems. Not one-time benchmark scores, but monitoring templates designed for fast, interpretable re-execution as schemas, content types, and underlying models evolve.
What We're Measuring Next
The gaps are clear:
- Multi-hop retrieval (51.7%) needs better query generation strategy — the E10 results show this is a query decomposition problem, not a retrieval problem
- Temporal reasoning (64.6%) needs improved conflict resolution beyond recency decay
- Cross-organization validation — do these patterns hold across varying content types and schema maturity levels?
- Production data evaluation — controlled synthetic experiments give upper bounds; production noise gives reality
We'll publish updated numbers as these improve. The evaluation pipeline is continuous, not a one-shot publication exercise.
Frequently Asked Questions
Why synthetic data instead of real production data? Synthetic data with embedded ground truth enables reproducible measurement with known fact counts, property values, and edge cases. You can't measure "fact recall" without knowing the ground truth. Production data is used for qualitative validation, but you can't compute precise recall without labels.
How do you handle the LLM-as-judge problem? Three mitigations: rubric-first prompting (the rubric is injected before the content, not after), trace-grounded evaluation (the judge has access to the full execution trace including tool calls and memory operations), and configurable cross-model evaluation (use a different model family for judging than for generation).
Are these results reproducible? All experiments use a fixed random seed (42) and were executed against the production API. The paper includes full dataset descriptions and algorithm specifications. We're releasing datasets and extended supplementary material.