The Knowledge Problem in Enterprise AI: Why RAG Isn't Enough, and What We're Building Instead

Enterprises have more knowledge than any single person can read. We want AI agents to read all of it and answer anything. But the gap between what we want and what's technically possible is where the real problem lives.

The Dream

Imagine asking your AI: "What's our refund policy for enterprise customers on annual contracts?"

The response you want is an answer, grounded in your company's actual policies, with reasoning about the specific case you mentioned.

The response you often get is something like: "The refund policy can be found on page 493 of refund-policy-v4.pdf in your SharePoint."

That's not helpful. We didn't want a pointer. We wanted the AI to read the file and answer.

But the dream is bigger than that. Finding the right file is just the first step. The AI also has to resolve the contradictions it finds: two policies that disagree, a draft that conflicts with the approved version, a 2022 memo that technically still lives in the wiki but has been superseded twice. Then it has to produce a single concise answer that is actually correct for this question.

And the bar shifts with the query. "What's our office address?" is a lookup. "Can I issue this $50K refund?" is a decision with legal exposure. Two queries, same system, but they demand different levels of caution, different sourcing discipline, different tolerance for ambiguity. The right answer to a low-stakes question is fast. The right answer to a high-stakes question is careful, sourced, and sometimes "I'm not sure, here's why, escalate to a human."

The promise of enterprise AI (that it would absorb all our knowledge and reason over it) keeps running into a wall that most people don't understand until they try to build it. Retrieval is the wall everyone hits first. Reconciliation and calibrated answering are the walls behind it.

Why AI Can't Just "Read Everything"

When you connect an AI to your SharePoint, your Confluence, your Google Drive, your internal wiki, the AI doesn't suddenly know what's in those systems. It has access, but it can't ingest all of it at once. Not today. Probably not ever.

The constraint is something called a context window. Every language model has a strict limit on how much text it can consider at one time. Even the most generous models cap at around a book's worth of words -- and that limit includes the AI's own reasoning and response, so you can never fill it entirely. A real enterprise knowledge base is easily a thousand times larger than any context window.

Beyond the hard limit, there are practical problems too. Sending massive content with every query is expensive -- token costs scale linearly with input size. It's slow -- large inputs mean slower responses. And it's wasteful -- most of what you're sending is irrelevant to the specific question.

So the question becomes: given that the AI can only see a small window of text at a time, how do we make sure the right text is in that window?

The Common Solution: RAG

The dominant answer for the past few years has been Retrieval-Augmented Generation, usually called RAG.

The idea is elegant. Take your large document corpus and chunk it into small pieces -- maybe paragraphs, maybe pages. For each chunk, compute a numerical representation called an embedding. You can think of an embedding as a fingerprint that captures the meaning of that chunk in a multi-dimensional space. Two chunks with similar meanings will have similar fingerprints.

When someone asks a question, you compute the embedding of the question and find the chunks whose fingerprints are closest to it. Those "nearest neighbors" are what you feed into the context window alongside the original question. The AI reads those chunks and answers.

This works remarkably well for simple question-answering. "What's our vacation policy?" retrieves the vacation section of the employee handbook. "Who is the CFO?" retrieves the leadership page. You don't need to send the entire knowledge base -- just the relevant chunks.

Why RAG Isn't Enough

When you move from simple questions to the kind of work enterprises actually need -- compliance-aware actions, multi-step procedures, reasoning across related documents -- RAG starts to struggle in ways that aren't obvious until you've deployed it.

Chunking destroys structure

Documents aren't just collections of words. A compliance policy has a scope, a set of constraints, and exceptions. A sales playbook has a sequence of stages. A runbook has ordered steps. When you chunk these documents, you split them mid-structure. A paragraph about "high-risk customers" no longer carries the context of what makes a customer high-risk, because that definition was three pages earlier and got chunked separately.

The AI retrieves a chunk that looks relevant but has lost its anchoring context. It confidently answers based on a fragment that doesn't mean what the AI thinks it means.

Similarity is not relevance

Vector similarity tells you that two pieces of text are semantically close. It doesn't tell you that one actually answers the other. A customer asking "how do we handle refunds?" might retrieve a chunk about email templates for refund requests, because the words overlap. The actual refund policy -- the document that tells the AI what to do -- might live in a different file with different vocabulary and score lower than a superficially similar but operationally useless chunk.

This is the single most underappreciated failure mode in production RAG systems. Teams celebrate high recall metrics while the system quietly returns plausible-looking but operationally wrong context.

Intent is invisible

A document mentioning "enterprise pricing" doesn't tell the AI whether it's our current pricing strategy, an old pricing decision that's been deprecated, a competitor's pricing we analyzed, or a hypothetical pricing exercise from a strategy offsite. Flat chunks strip out metadata about what kind of document this is and what the AI should do with it.

For a simple question answer, this might not matter. For an AI agent taking action -- drafting an email, updating a CRM record, making a refund -- it matters a lot. The difference between "follow this rule" and "this is background context you should know about" is the difference between a policy violation and a correct action.

What We've Been Building

I've been reading the literature in this area for months. The problems with RAG aren't secrets -- they're openly discussed in papers from Databricks, Microsoft, and the broader information retrieval community. The consensus direction is often called "context engineering": treating retrieval not as a search problem but as a knowledge delivery problem, with structure, governance, and reasoning baked in.

We've been designing a system along these lines. Here's how we've approached it.

Principle 1: Documents have types, and types matter

A policy is not a playbook. A playbook is not a reference document. A reference document is not a template. Each of these serves a different purpose and should be treated differently by the AI.

In our system, every knowledge document has an explicit type:

A guideline is an enforceable rule. "API responses must use camelCase." "Don't send marketing emails to customers who have opted out." When the AI sees a guideline, it understands this is a constraint it must follow, not a suggestion.

A playbook is a process. "How to onboard a new enterprise customer." The AI reads it as an ordered sequence of steps to follow, not a static reference.

A reference is factual knowledge. "Our supported payment methods by region." The AI reads it on demand, like looking something up, not as something it must actively apply.

A template is an output scaffold. "Cold outbound email structure." The AI uses it to shape what it produces.

A brief is context. "Account dossier for Acme Corp." The AI reads it to understand the situation, not to follow instructions.

Same words, different behavior. A paragraph that reads "refunds over $500 require manager approval" behaves like a hard constraint when typed as a guideline and like background information when typed as a brief. The structure tells the AI how to interpret the content.

Principle 2: Scoring must combine similarity, vocabulary, and importance

Pure vector similarity is not enough. We use a hybrid scoring formula:

80% embedding similarity -- does the meaning of this document match the meaning of the question? This is what RAG traditionally uses.

20% keyword overlap -- does the exact vocabulary match? Technical terms, product names, specific policy identifiers often matter more than semantic similarity. A query mentioning "SOX compliance" should prefer a document explicitly about SOX over one that vaguely mentions "financial controls."

Plus governance boosts -- certain documents are marked as always-on (they should surface for every query in their domain, regardless of similarity). Others have trigger keywords that boost their ranking when those keywords appear in the query. An "always-on" brand voice guideline adds +0.15 to its score. Each trigger keyword adds up to +0.04, capped at +0.20 total.

This combination lets you express things that pure similarity can't. "This compliance policy must surface whenever anything financial is being discussed." "This brand voice guideline must surface for every customer-facing output." Meaning matters, but so does exact vocabulary, and so does explicit importance.

Principle 3: Deliver sections, not whole documents

A 3,000-word compliance policy usually has one section that answers a specific question. Retrieving the whole document wastes tokens on irrelevant sections and dilutes the AI's attention.

In our system, every ## heading in a document is independently routable. The scoring runs at both the document level and the section level. If a document scores 0.7 overall but one specific section scores 0.9 for the current query, we deliver just that section with enough surrounding context to preserve meaning.

This is the single biggest source of efficiency gains. We've measured 50-80% token savings from section-level delivery compared to full-document retrieval, with higher relevance because the AI isn't distracted by irrelevant sections.

Principle 4: Token budget is a constraint, not an afterthought

Every query has a finite token budget for context -- we default to 10,000 tokens, but organizations can tune it. When the retrieval system returns candidates whose combined size exceeds the budget, lower-scoring content gets demoted or dropped entirely.

This forces the system to make hard choices: what's most important for this specific query? It prevents the common RAG failure mode where the context window fills with mediocre-but-cheap-to-retrieve content, leaving the AI to reason through noise.

Principle 5: Modes for different reasoning needs

Not every query needs the same level of analysis. For simple lookups -- "what's the escalation path for billing issues?" -- embedding similarity is fast and accurate. For complex cross-document reasoning -- "given this customer's history and our enterprise pricing framework, what should our response be?" -- pure similarity isn't enough. The system needs to actually reason about relationships.

We support three modes:

Fast mode (~200ms): embedding-only retrieval. Good for real-time agent queries.

Deep mode (2-5 seconds): LLM-assisted analysis on top of embeddings. The system reads candidate documents and reasons about which actually address the query. Slower, more expensive, but substantially more accurate for complex questions.

Auto mode: the system decides based on query complexity signals. Simple lookups get fast mode. Analytical queries trigger deep mode.

This lets organizations trade off speed, cost, and accuracy per query instead of choosing one setting for everything.

Principle 6: Structure preservation through attachments

A knowledge document often needs supporting materials: a code template, a schema definition, a chart, a screenshot. RAG systems typically either ignore these or chunk them into oblivion.

In our system, a document can have typed attachments -- scripts, templates, configs, schemas, references, images. The attachments stay linked to their parent document. When the AI retrieves the document, it knows what supporting materials are available and can fetch them on demand.

This matters more than it sounds. A compliance policy that references a specific checklist is only useful if the AI can actually pull the checklist. A playbook that references a template is only useful if the template is accessible. Keeping the links intact is what turns a flat document library into a navigable knowledge system.

A Concrete Example

We tested this approach on our own engineering codebase -- 1,322 TypeScript files totaling about 1.5 million tokens of source code and documentation.

We wrote 29 short architecture documents, one per module. Each document was a reference type, with sections for purpose, key files, routes, dependencies, and danger zones. Total size: about 27,000 tokens. That's a compression ratio of roughly 57x.

Then we ran 20 real engineering queries through three approaches:

Comparison of token usage across three approaches

Raw search -- the baseline where an AI agent greps the codebase blindly -- used about 30,000 tokens per query, reading mostly wrong files.

A knowledge graph approach (using the popular Graphify tool) used only 594 tokens per query, but captured only 42% of the relevant technical terms in its responses. The graph knew where things were connected but not why.

Our structured context approach used about 1,550 tokens per query with 81% keyword relevance -- nearly twice the relevance of the graph approach at a third of the cost of raw search. When we accounted for the full workflow (retrieval plus the follow-up file reads the agent still needs), structured context won on total cost.

The ratio isn't what's interesting. What's interesting is why: because the structured context told the agent exactly which two files to read, the agent didn't waste tokens on exploration. Graph approaches tell you where things connect but not what to do. Search tells you where things are but not which ones matter. Structured context tells you both, because it was written by people who know.

Why This Matters for Enterprise Data

The experiment was on code, but the underlying pattern applies to any complex enterprise knowledge base.

Financial services have compliance policies that must be followed before any customer-facing action. Healthcare has HIPAA-governed patient data that must be treated with specific protocols. Sales organizations have playbooks that define engagement stages. Engineering has architecture decisions that explain why systems were built the way they were. Operations has runbooks that guide incident response.

None of this is searchable the way RAG imagines. A vector search over compliance policies will return the chunk whose words most resemble the query. It won't necessarily return the policy that governs the specific action the AI is about to take.

The pattern we've been developing -- typed documents, hybrid scoring with governance boosts, section-level delivery, token-budget awareness, and mode-adaptive reasoning -- is designed for exactly this. It treats enterprise knowledge as a structured, governed system rather than a flat search corpus.

The promise isn't that AI will read all your files. It's that AI will navigate your knowledge the way your most experienced employee does: knowing what kind of document addresses a question, which section within that document to focus on, which related materials matter, and when to think more carefully before acting.

Beyond Retrieval: The Reconciliation Problem

Everything above is about getting the right material into the context window. That's necessary, but it isn't sufficient. Once the right chunks are in front of the model, a second, harder problem starts.

Real enterprise knowledge bases contradict themselves. The same question often has three answers sitting in three documents: the one legal approved last quarter, the one a PM drafted two weeks ago, and the one someone wrote in a wiki page in 2022 that nobody remembered to delete. A well-behaved retriever will happily return all three. The AI then has to decide which one is the answer.

That decision is not a retrieval problem. It's a judgment problem, and it depends on dimensions that flat similarity scores don't capture:

Freshness. A document updated yesterday usually beats one updated in 2022, but not always. A foundational policy from 2019 may still be authoritative; a hastily updated draft from yesterday may not be. The system needs to know which documents are time-sensitive and which aren't.

Source authority. A policy published by Legal outranks a how-to written by an intern. A signed-off runbook outranks a Slack thread that got pasted into Confluence. Who wrote it, who approved it, and under what process, all matter more than vector distance.

Approval state. Draft, in-review, approved, deprecated, archived. An AI answering from a draft is quietly dangerous. An AI answering from a deprecated document is worse. Retrieval systems that don't model lifecycle state will confidently cite documents that a human would know not to trust.

Query stakes. The consequences of being wrong vary enormously across queries. Giving the wrong office address is embarrassing. Giving the wrong refund authorization is expensive. Giving the wrong HIPAA disclosure answer is a lawsuit. The system needs to calibrate its own confidence to the cost of being wrong, and sometimes the right behavior is to refuse to answer and escalate.

Sensitivity. Some queries touch regulated data, confidential strategy, or personal information. The right answer to a sensitive query isn't just "correct," it's correct and appropriately scoped to who's asking and why.

We've solved parts of this. Typed documents carry approval state. Governance boosts let authoritative sources outrank similar-but-unofficial ones. Deep mode gives the system a chance to notice contradictions before answering. But honestly, a lot of this is still open. Detecting genuine contradictions across documents, deciding when to refuse, producing an answer that's concise without being dangerously lossy, calibrating tone to query stakes, none of that has a clean solution yet.

The reason I'm writing this is that I think the industry has been optimizing the wrong layer. A lot of effort is going into better embeddings, better chunking, better re-rankers. Those help. But the ceiling on enterprise AI usefulness isn't retrieval quality. It's whether the system knows what it's looking at once it has it, and whether it answers with the right level of caution for what's being asked.

What's Next

We've built the core system and validated it on our own codebase. The results are promising enough that we're now running it across more complex document types: governance policies, sales playbooks, customer context documents. Early signals suggest the pattern transfers -- the same principles that helped our engineering agents navigate TypeScript modules help our sales agents navigate account histories and playbooks.

We're designing a more rigorous evaluation for publication: a three-arm experiment with human grading, multiple repetitions, statistical tests. The preliminary numbers are strong, but strong numbers on a single repo with auto-scoring isn't yet a scientifically defensible claim.

What I'm more interested in is what this approach opens up. If AI agents can reliably navigate structured enterprise knowledge, a lot of work that currently requires a senior employee's judgment becomes automatable -- not by replacing the employee, but by extending their reach. An agent that actually understands your compliance landscape, your sales playbook, your incident response procedures, and your customer context can handle routine work that's currently stuck waiting for human attention.

That's the promise. We're still early. But the gap between "search retrieves chunks that look similar" and "the agent knows how to navigate our knowledge" is where the interesting work is right now.