An agent needed to know how many contacts were in Victoria. It called our semantic search endpoint, got back 47 text chunks mentioning Victoria, and hallucinated a count from the content. The actual number was 51.
That was the moment I realized the problem wasn't our memory system. The memory system was fine. The problem was that we had ten retrieval endpoints and every agent had to figure out which one to call. A counting question hit the semantic search path because that was the only path the agent knew about. The right answer required a property filter with a count aggregation. Two completely different operations.
We had built memory_search, memory_digest, memory_recall_pro, memory_similarity, memory_segment, memory_compare, memory_browse, memory_count, memory_aggregate, and memory_analyze. Ten endpoints, each doing something genuinely different. And every time we onboarded a new agent or a new customer integration, we had to teach the caller which endpoint handled which question.
That's the story of why we built SmartRecall Unified -- a single natural-language endpoint that replaces all ten. The caller describes what they need in plain English. The system classifies the intent, builds an execution plan, runs it across the right data sources, and returns a structured response with token budgeting.
It sounds simple. It took five rewrites of the classifier to get it working.
The surface area problem
Each of those ten endpoints existed for a good reason. Semantic search and property filtering are fundamentally different operations. Counting records and comparing two entities side-by-side have nothing in common except that they both read from memory. A digest that compiles full context on a person requires assembling data from multiple stores -- vector search for memories, DynamoDB for structured properties, identity resolution for alias merging. A filter query doesn't need any of that.
The problem wasn't that we had built too many endpoints. The problem was that the agent calling the system shouldn't need to know which retrieval strategy to use. That's an infrastructure concern. When a sales rep asks "how many leads came from the webinar?" they're not thinking about whether that's a count query on a property filter or a semantic search with aggregation. They're asking a question.
We were leaking implementation details into the interface.
The first classifier (53% accuracy)
The first version was pure regex. Pattern-match the query, route to the right backend. "How many" goes to count. "Compare" goes to compare. "List all" goes to browse.
It worked for the obvious cases and broke on everything else. "Who are my top prospects?" -- is that a filter, a segment, or a similarity search? "Tell me about the Acme deal" -- is that a recall, a digest, or a search? "Contacts like Sarah but in enterprise" -- similarity with a filter condition? The regex couldn't tell.
53% accuracy on our benchmark set. Worse than a coin flip if you account for the distribution.
The second and third attempts
We added embedding-based classification. Encode the query, compare against exemplar embeddings for each intent, pick the closest match. This got us to 64%.
The problem was intent overlap. "Find contacts similar to our best customer" and "rank contacts by ICP fit" are semantically close but require completely different execution strategies -- similarity search vs. segment scoring. Embeddings encode meaning, not operational intent.
We merged some intents. Recall and digest were causing the most confusion because "tell me about John" and "full context on John" are the same question at different depth levels. Once we treated digest as recall-with-full-depth instead of a separate intent, accuracy jumped to 72%.
But 72% means one in four queries hits the wrong backend. For a system that agents depend on, that's not close to good enough.
What actually worked: a three-layer hybrid
The fifth rewrite landed on a three-layer classifier that handles queries in tiers:
Layer 1: Rule overrides (0ms). About 60 regex patterns that catch unambiguous queries with zero false positives. "How many" is always a count. "Compare X vs Y" is always a compare. "List all contacts" is always a browse. This layer handles 65% of all queries instantly, with no API call.
Layer 2: Embedding similarity (50-100ms). 245 curated exemplar embeddings across 9 intents. But pure cosine similarity wasn't enough -- the overlap problem remained. The fix was a hybrid scoring function: 55% cosine similarity, 45% keyword overlap. The keyword signal disambiguates cases where embeddings are too close. "Rank by engagement" and "find similar contacts" have similar embeddings but very different keyword signatures.
Layer 3: LLM fallback (500-2000ms). Only triggered when the top two intents from Layer 2 are within a margin of 0.03. A fast LLM call with few-shot examples breaks the tie. In practice, this fires on less than 5% of queries.
The result on 335 benchmark queries across 20 categories:
| Layer | Coverage | Accuracy | Latency | |-------|----------|----------|---------| | Rules | 65% of queries | 100% | 0ms | | Embeddings | ~30% of queries | 96% | 50-100ms | | LLM fallback | ~5% of queries | 91% | 500-2000ms | | Overall | 100% | 97.9% | 168ms avg |
The accuracy journey across all five versions:
| Version | Accuracy | What changed | |---------|----------|--------------| | v1 (regex-only) | 53% | Initial attempt | | v2 (+ embeddings) | 64% | Added embedding similarity | | v3 (+ merged recall/digest) | 72% | Merged overlapping intents | | v4 (+ rules + exemplars) | 75% | More rules, more exemplars, keyword scoring | | v5 (full rewrite) | 97.9% | 60+ zero-false-positive rules, 245 exemplars, hybrid 55/45 blend |
The jump from 75% to 97.9% came from a counterintuitive decision: instead of making the embedding layer smarter, we moved everything we could into the rule layer and only sent the genuinely ambiguous cases to embeddings. Rules are deterministic and free. Embeddings are probabilistic and cost an API call. Push the boundary of what rules can handle, and the embedding layer only sees the hard cases it's suited for.
Classification is only the first problem
Knowing the intent is step one. You still need to extract structured information from the query. "CTOs at enterprise companies in healthcare" needs to become:
[
{ field: "job_title", operator: "CONTAINS", value: "CTO" },
{ field: "company_size", operator: "GT", value: 500 },
{ field: "industry", operator: "CONTAINS", value: "healthcare" }
]This extraction runs in parallel with classification. The classifier detects filter conditions, identity signals (person name, company, email), enrichment depth, sort order, and numeric limits -- all from the natural language query.
The extraction accuracy is honestly mixed. Location extraction hits 100%. Job title is at 73%. Company name is at 42%. Name extraction is at 16%, but that's because most name-based queries are handled by the identity resolution layer rather than property filtering. The system compensates for weak extraction by falling back to semantic search when structured filters don't match enough records.
I'd like the extraction to be better. We're still iterating on it.
Token budgeting: the underappreciated detail
Here's something I didn't anticipate being important: controlling how much data comes back per record.
When an agent asks "how many contacts do I have?" it needs a number. Returning full property sets for 500 contacts is a waste of tokens and money. When the same agent asks "prep me for the meeting with Sarah," it needs everything.
We built five depth levels:
| Depth | Tokens per record | What's included | Auto-triggered by |
|-------|-------------------|-----------------|-------------------|
| ids | ~20 | Record ID only | Count queries |
| labels | ~50 | ID + display label | Browse, filter |
| summary | ~150 | + 3-5 key properties | Search, analyze, segment |
| context | ~500 | + compiled digest + top memories | Compare |
| full | ~1000 | + all properties + all memories | Meeting prep, deep recall |
The depth is auto-detected from the query. "How many contacts?" returns ids depth. "Contacts in Victoria" returns labels. "Prep me for the call with Sarah" returns full. The caller can override it, but in practice auto-detection is right about 61% of the time, and when it's wrong it usually errs toward returning slightly more context than needed rather than less.
The impact is significant. A batch operation across 50 contacts at labels depth costs ~2,500 tokens. The same operation at full depth would cost ~50,000. The system decides which budget is appropriate before executing the query, not after.
Session context: harder than it looks
We added session support so agents can ask follow-up questions. Pass a session_id, and SmartRecall resolves pronouns against previous results.
// First query
const r1 = await client.smartRecallUnified({
message: "contacts in Victoria",
sessionId: "sess-123",
});
// Follow-up — "them" resolves to the Victoria contacts
const r2 = await client.smartRecallUnified({
message: "rank them by engagement",
sessionId: "sess-123",
});The implementation is an LRU cache with a 5-minute TTL. Simple enough. The hard part was deciding what to cache. We store the record IDs and the classified intent from the previous query, not the full response. This keeps the cache small and avoids stale data -- the follow-up query re-fetches current properties rather than using cached values that might have changed.
Session follow-up accuracy is at 90% in benchmarks. The 10% failures are mostly cases where the pronoun reference is genuinely ambiguous ("compare them" after a query that returned two distinct groups). Still working on this.
What the response actually looks like
A query like "contacts in Victoria who are CTOs" returns:
{
"plan": {
"classifiedAs": ["filter"],
"steps": ["property_filter", "enrich_labels"],
"confidence": 0.92
},
"records": [
{
"recordId": "contact-abc123",
"displayName": "Sarah Chen",
"email": "sarah@techflow.io",
"score": 0.95,
"relevanceTier": "direct",
"properties": {
"location": "Victoria",
"job_title": "CTO"
},
"freshness": { "score": 0.85 },
"completeness": { "score": 0.72 }
}
],
"warnings": [
{
"type": "potential_duplicate",
"records": ["contact-abc123", "contact-def456"],
"reason": "2 records share the name 'Sarah Chen'. Consider merging."
}
],
"meta": {
"totalMatched": 12,
"returned": 10,
"enrichmentDepth": "labels",
"tokensUsed": 620,
"latencyMs": 210
}
}A few things to notice. The response includes freshness and completeness scores per record, so the consuming agent knows how current and how complete each profile is. The warnings array is proactive -- the system flags potential duplicates even though the caller didn't ask about duplicates. The plan object exposes what the classifier decided, so if something routes incorrectly the caller can see why.
These signals -- freshness, completeness, warnings, suggested actions -- are what separate a retrieval system from an intelligence layer. They turn "here are your results" into "here are your results, and here's what you should know about their quality."
What I'd do differently
If I were starting over, I'd build the rule layer first and ship it. Rules at 0ms with 100% precision on unambiguous queries would have covered the majority of real usage. We spent months on embedding classification before realizing that the cheapest, fastest, most reliable layer should handle the bulk of traffic.
I'd also invest more in extraction accuracy earlier. The classifier knowing the intent is table stakes. The extraction knowing that "enterprise companies" means company_size > 500 is what makes the response actually useful. We built extraction as a secondary concern and it shows in the accuracy numbers.
The hybrid 55/45 scoring blend (cosine similarity vs. keyword overlap) was discovered through experimentation, not theory. I suspect there's a better weighting, and it probably varies by intent category. We haven't explored per-intent blending yet.
The architectural principle
An AI agent should describe what it needs in natural language. The infrastructure should figure out how to get it.This sounds obvious stated as a principle. In practice, almost every AI memory system exposes retrieval strategy as the API surface -- you call a vector search endpoint, or a graph traversal endpoint, or a key-value lookup. The agent has to know what kind of question it's asking before it can ask the question. That's backwards.
SmartRecall Unified is our attempt at inverting that. One endpoint, natural language in, structured response out, with the system choosing the strategy. The classification is at 97.9%. The extraction needs work. The token budgeting saves more context window space than I expected. The session support mostly works.
It's not done. But it's working in production, and the agents using it no longer need to know the difference between a filter, a count, and a comparison. That was the goal.
SmartRecall Unified is part of the governed memory system at Personize. The implementation details in this post are from the current production deployment. The benchmark datasets and runners are in our experiments repository.