Reflection-Bounded Retrieval: +25.7pp Completeness on Hard Queries

When the information an agent needs is scattered across 3–5 sources, a single retrieval pass misses most of it. Here's what actually works — and the surprising finding about what drives the gain.

TL;DR

On hard multi-hop queries where information is scattered across 3–5 sources, a single retrieval pass achieves 37.1% completeness. Reflection-bounded retrieval reaches 62.8% — a +25.7 percentage point improvement.
The surprising finding: query generation strategy matters more than round count. One additional round gets you from 37.1% to 61.2%. A second round only adds 1.6pp more.
API-managed reflection (+3.3pp) dramatically underperforms manual multi-hop (+25.7pp). The gap is a query decomposition problem, not a retrieval problem.
The reflection loop is bounded — configurable max rounds (default: 2) with predictable latency per round.
Reflection is effective when information exists but is scattered. It yields diminishing returns when underlying data is absent.

Standard vector retrieval makes one pass. Query goes in, semantically similar results come out. For most queries, this works well enough. But for a specific category of queries — ones where the information you need is distributed across multiple sources, requires synthesizing facts from different time periods, or depends on understanding how earlier facts qualify later ones — a single pass is structurally insufficient.

Not because the retrieval is bad. Because the question is multi-hop.

The Problem with Multi-Hop Queries

Consider: "What did the prospect's CTO say about vendor evaluation, and how does that relate to the API reliability concerns the support team logged last month?"

This requires:

Facts from a sales call about vendor evaluation
Facts from a support ticket about API reliability
The relationship between those two threads

A single embedding query for this full question will retrieve the most semantically similar chunks — probably some of each, but not necessarily all of each, and not necessarily the bridge between them. The embedding similarity search doesn't know it's looking for a complete picture. It returns what's most similar to the query vector.

The result: partial answers that look complete. The agent has some of what it needs, not all of it, and has no mechanism to detect the gap.

The Reflection Loop

Reflection-bounded retrieval addresses this by adding an iterative check after the initial retrieval pass.

The loop works as follows:

Round 0 (always runs): Standard retrieval — query embedding, vector search with entity-scoped filtering, post-filtering by metadata. Returns an initial result set.

Reflection check: An LLM judges evidence completeness at low temperature (0.1). The judge has the original query and the current result set. It assesses: Is this sufficient to answer the question? If yes, stop. If no, what's missing?

Follow-up query generation: If incomplete, the same LLM generates one to two targeted follow-up queries at moderate temperature (0.3). These queries are designed to retrieve the specific information identified as missing — not to re-ask the original question, but to fill the identified gap.

Round N: Run the follow-up queries through the same retrieval path. Merge results across rounds by identifier (deduplication prevents the same memory appearing twice). Return to the reflection check.

The loop is bounded by a configurable maximum round count (default: 2). Each round adds predictable latency: one LLM call plus zero to two embedding-and-search operations.

The reflection-bounded retrieval loop — iterative gap detection and filling

What the Numbers Say

We tested this on 10 hard multi-hop queries — questions where the complete answer requires information from 3–5 distinct sources — under four conditions:

| Condition | Avg Completeness | Avg Results | Avg Latency | |---|---|---|---| | No reflection (baseline) | 37.1% | 15.0 | 9.4s | | API-managed, 1 round | 40.4% | 22.5 | 10.4s | | Manual multi-hop, 1 round | 61.2% | 20.8 | 6.5s | | Manual multi-hop, 2 rounds | 62.8% | 21.9 | 10.0s |

Completeness comparison across four retrieval conditions

Manual multi-hop with 2 rounds achieves 62.8% completeness versus 37.1% baseline — a 25.7 percentage point improvement.

The result that demands explanation is the API-managed vs. manual multi-hop comparison. API-managed reflection with one round adds only 3.3pp (37.1% → 40.4%). Manual multi-hop with one round adds 24.1pp (37.1% → 61.2%). Same number of rounds. Dramatically different results.

The Real Lever: Query Generation

The gap between 40.4% and 61.2% isn't a retrieval gap. The retrieval mechanism is identical — same vector search, same entity scoping, same filtering. The difference is in how the follow-up queries are generated.

API-managed reflection generates follow-up queries automatically based on the reflection LLM's completeness judgment. The query generation is generic — it identifies what's missing and produces a query to find it, but that query is often too similar to the original to surface substantially different results.

Manual multi-hop decomposes the original question into specific sub-questions before retrieval begins. "What vendor evaluation facts exist for this contact?" and "What API reliability issues were logged for this account?" are better queries than "vendor evaluation and API reliability context" — they're targeted, specific, and designed to retrieve distinct information that can be synthesized.

Query decomposition strategy is the primary lever in reflection-bounded retrieval, not round count.

This is also visible in the round count data. One round of manual multi-hop (61.2%) gets within 1.6pp of two rounds (62.8%). Most of the completeness gain materializes in the first additional retrieval pass. The second round adds marginal value — useful for hard queries with deep information scatter, but not the primary driver.

We're actively working on improving the API query generation to close the 21pp gap between API-managed and manual approaches. The architecture is in place; the remaining work is query decomposition quality.

When Reflection Helps and When It Doesn't

Reflection is effective under a specific condition: the information exists in the store but is scattered across sources that a single query doesn't fully surface.

It has diminishing returns when:

The underlying data isn't there. If the memory store doesn't contain what the query needs, additional retrieval rounds won't find it. Reflection retrieves what exists — it doesn't generate information that isn't there.
The query is simple and single-hop. A direct question about a single entity fact (budget, role, buying stage) doesn't benefit from reflection. The initial pass surfaces it. Adding a reflection check adds latency for no gain.
The result set is already complete. The reflection check's purpose is to detect incompleteness. If the initial pass is comprehensive, the check returns "sufficient" and the loop terminates without additional rounds.

In production, most queries are simple and don't trigger the reflection loop. Reflection activates for the hard cases — the queries where incompleteness would actually matter and where additional retrieval rounds have real information to find.

Bounded Latency

The latency profile matters for production deployments. Each reflection round adds: one LLM call (for the completeness check and query generation) plus zero to two embedding-and-search operations.

With default settings (max 2 rounds, ~850ms per round on the retrieval side, ~1–2s for the LLM calls), the worst-case total latency is roughly 10s — as measured in the experiment. For queries that don't need reflection, latency stays near the baseline (9.4s in the experiment, primarily from the initial vector search and synthesis).

This is predictable. The bounded round count ensures reflection can't run indefinitely. Organizations can tune the bound based on their latency tolerance and the average information density of their entity memory stores.

How This Fits the Broader Architecture

Reflection-bounded retrieval is Layer 3 of the four-layer architecture. It operates on top of the dual memory store (Layer 1), which means the reflection loop has both open-set facts and schema-enforced properties available to retrieve from.

The quality gate mechanism in Layer 1 compounds here: facts that enter the store are coreference-resolved, self-contained, and temporally anchored. When the reflection loop generates follow-up queries based on "what's missing," it's querying a clean store where the information that exists is reliable. Reflection on a noisy store would surface noise in addition to the missing signal, reducing the effective gain.

Frequently Asked Questions

How does the completeness check work? Isn't it just another LLM call? Yes, it's an LLM call, but at low temperature (0.1) and with a narrow, binary task: assess whether the retrieved evidence is sufficient to answer the original question, and if not, identify what's missing. This is a different task than answer generation — it's evaluation, not synthesis. The determinism of low temperature makes the check reliable rather than creative.

Can reflection introduce hallucination? The reflection loop generates queries, not answers. Hallucination in the query generation phase produces follow-up queries that don't retrieve useful results — the loop finds nothing additional and terminates. It doesn't introduce false information into the result set, because results are retrieved, not generated.

What's the tradeoff between max rounds and quality? The data shows most gains come in the first additional round. Setting max rounds to 1 captures ~94% of the two-round completeness gain (61.2% vs. 62.8%) at lower latency. For high-stakes queries where completeness is paramount, two rounds is appropriate. For general production use, one additional round is usually the right balance.

Does reflection work across both open-set and schema-enforced memory? Yes. The retrieval path operates across both memory types simultaneously. Follow-up queries can surface open-set facts, schema-enforced properties, or both, depending on what the reflection check identifies as missing.