AI coding agents burn ~30,000 tokens grepping blindly through a codebase to fix a single bug. They shouldn't have to.
TL;DR
- Compressed 1.5M tokens of source into 27K tokens of curated module docs (a 57x ratio), then benchmarked against Graphify and raw grep on 20 engineering tasks
- Raw search (grep + read) costs ~30,000 tokens per query and reads mostly wrong files
- Graphify's knowledge graph costs 594 tokens per query but only surfaces 42% of relevant technical terms
- Curated module references cost 1,550 tokens per query with 81% keyword relevance
- When you account for the full workflow (context + follow-up reads), curated references win on total cost
- The real insight: these approaches aren't competitors, they're complementary -- and the pattern extends far beyond code
The Problem Every AI Agent Has
Last week, Graphify went viral -- 6,000 GitHub stars in 48 hours, inspired by Andrej Karpathy's knowledge base workflow. One command, any folder, full knowledge graph. I installed it immediately.
But as I watched it build a graph of our production codebase (an agent execution platform on AWS Fargate: TypeScript, Express, DynamoDB, LanceDB, ~130 Lambda functions), I started wondering: does structural knowledge actually help an agent fix a bug?
When I tell Claude Code "the recall endpoint returns stale properties after memorize," what does it do? It greps for "recall" across 1,322 files, reads the top 10 matches, and spends ~30,000 tokens scanning through mostly irrelevant code. It eventually finds the answer, but the process is expensive and wasteful.
The agent doesn't need to read 10 files. It needs to know three things:
- Recall reads from DynamoDB (not LanceDB)
- Memorize writes to LanceDB first, then fires-and-forgets to DynamoDB
- There's a record-ID cache that every write endpoint must invalidate
That's 50 words of architectural knowledge. The agent is spending 30,000 tokens to discover what could be communicated in 200.
The 57x Compression
I wrote 29 short architecture documents -- one per module in our codebase. Each document explains:
- Purpose: What the module does in one paragraph
- Key files: Every important file with a one-line description
- Routes: All API endpoints with HTTP methods and handlers
- Dependencies: What this module imports from and what imports it
- Danger zones: What breaks if you change this, what never to modify
No source code. Just structural knowledge with file path references. The whole set is 27,000 tokens covering 1.5 million tokens of source -- a 57x compression ratio.
I uploaded these to our SmartContext system, which uses embedding similarity + section-level routing to surface the right module doc when an agent asks a question. Then I ran the same 20 queries through all three approaches.
The Benchmark
I designed 20 engineering tasks across four categories that real developers encounter daily:
Bug localization (5 queries): "The recall endpoint returns stale data after memorize" -- find the root cause.
Feature understanding (5 queries): "How does the executor handle client-side tool calls?" -- explain the architecture.
Code modification (5 queries): "Add a new memorize tier called enterprise" -- identify the right files to change.
Impact analysis (5 queries): "What breaks if we rename LanceDBVectorStore?" -- map the blast radius.
Each query ran through all three approaches, measuring tokens consumed, keyword relevance (did the response contain the right technical terms?), and whether the agent was pointed to the correct module.
The Results
Raw Numbers
| Metric | Raw Search | Graphify | Curated Refs | |--------|-----------|----------|-------------| | Avg tokens/query | 29,956 | 594 | 1,550 | | Total (20 queries) | 599,128 | 91,889* | 31,009 | | Keyword relevance | 100% | 42% | 81% | | Module accuracy | 95% | N/A | 85% |
*Graphify total includes ~80K tokens for the one-time graph build via LLM extraction.
Graphify wins on per-query token cost by a wide margin. But here's what the raw numbers don't capture.
What Each Approach Actually Returns
Raw search gives you file contents. The agent reads recall.service.ts, memorize.service.ts, and 8 other files. It has the answer somewhere in 30,000 tokens of code -- but it has to figure out which parts matter.
Graphify gives you node names and edges: "MemoryController has 41 edges. LanceDBVectorStore has 45 edges. MemoryController calls recallPro." Structurally correct, but the agent still doesn't know about the dual-write pattern or the cache invalidation requirement. It needs to read the source files next.
Curated references give you architectural context: "Memorize writes to LanceDB (blocking) then DynamoDB (fire-and-forget). ALL 5 write endpoints must bust the record-ID cache. Key files: recall.service.ts, memorize.service.ts." The agent knows exactly which 2 files to open and what to look for.
The Full Workflow Cost
This is the number that matters. After getting context, the agent still needs to read source files to make changes. How many?
| Approach | Context cost | Follow-up reads | Total | Accuracy | |----------|-------------|-----------------|-------|----------| | Raw search | 29,956 | 0 (already read) | ~30,000 | 95% | | Graphify + 3 files | 594 | ~9,000 | ~9,600 | ~60% | | Curated refs + 2 files | 1,550 | ~6,000 | ~7,550 | 85% |
Curated references win on total workflow cost because the agent reads 2 targeted files instead of 3 uncertain ones.
Per-Category Breakdown
The most interesting finding was how performance varied by task type.
Code modification was our best category: 5/5 perfect routing. When the query was "add a new memorize tier" or "add HubSpot sync for deals," SmartContext routed to exactly the right module every time. These queries contain specific domain terms that match the module docs well.
Impact analysis was our weakest: 2/5 exact matches. "What breaks if we change the DynamoDB MEMBER key pattern?" requires understanding cross-module dependencies. A graph structure is genuinely better at answering "what connects to what" -- that's what graphs are designed for.
What I Learned
1. Graphs capture structure. Curated docs capture intent.
Graphify correctly identified that LanceDBVectorStore has 45 edges and is the most connected node in the memory module. But it couldn't tell the agent that the dual-write is fire-and-forget, that DynamoDB may lag behind LanceDB, or that changing the embedding model invalidates all stored vectors.
These are design decisions, not structural facts. They live in developers' heads, in architecture decision records, in comments that say "// NOTE: this is intentional." A syntax tree can parse the code; it can't parse the reasoning behind the code.
2. Most enterprise knowledge isn't code
Our codebase is 1,322 TypeScript files. But it also contains 595 documentation files, 23 governance guidelines, 15 platform playbooks, 27 Terraform modules, and 80+ reference guides. The documentation outweighs the code by nearly 2:1.
This ratio is typical for production enterprise systems. Microsoft's 2026 enterprise AI roadmap describes a shift from information retrieval to governed agent workflows -- systems that don't just find knowledge but apply it with accountability. Researchers at RAGFlow describe RAG evolving from "Retrieval-Augmented Generation" into a "Context Engine" -- intelligent retrieval as the core infrastructure for enterprise AI.
When you step outside the codebase, the pattern becomes even clearer. Enterprises have:
- Compliance policies that agents must follow before taking action
- Sales playbooks that define how to engage prospects at each stage
- Onboarding procedures that new team members (human or AI) must learn
- Architecture decision records that explain why, not just what
- Customer context that must persist across conversations
None of this is code. None of it lives in a syntax tree. All of it is critical for an agent to operate effectively.
3. The right approach depends on the question
| Question type | Best approach | Why | |--------------|--------------|-----| | "Where do I add this feature?" | Curated module docs | Points to exact files and routes | | "What imports this class?" | Knowledge graph | Structural traversal is perfect for this | | "Why does this module work this way?" | Curated docs | Design intent can't be parsed | | "What's the blast radius of this change?" | Graph + curated docs | Graph for connections, docs for constraints | | "How do I debug this failure?" | Curated docs + raw search | Docs for orientation, source for details |
The strongest setup would combine a knowledge graph for structural queries with curated docs for architectural context -- using each where it excels.
Beyond Code: Where This Pattern Opens Up
The experiment was about codebase navigation, but the underlying pattern -- structured context documents, routed by semantic similarity, delivered within a token budget -- applies anywhere AI agents need to navigate complex knowledge.
Regulated Industries
In healthcare, financial services, and legal, AI agents can't just search for answers. They need to follow specific procedures, cite specific policies, and stay within compliance boundaries. A curated context document that says "For HIPAA-covered entities, patient data MUST be redacted before LLM processing -- see policy §4.2" is fundamentally different from a vector search hit that returns a paragraph mentioning HIPAA.
Enterprise Onboarding
When a new engineer joins a team, they spend weeks learning which modules are dangerous, which patterns are preferred, which decisions were made and why. The same is true for AI agents. A curated module reference is essentially an onboarding document that an agent can consume in 1,550 tokens instead of 30,000.
Customer-Facing Agents
Sales agents need playbooks. Support agents need escalation procedures. Account managers need relationship context. In each case, the agent doesn't need to search through every document in the knowledge base -- it needs the right 2-3 documents routed based on what it's trying to do right now.
Multi-Agent Systems
As agent architectures evolve from single agents to multi-agent orchestration, the cost of context becomes multiplicative. If 5 agents each spend 30,000 tokens searching blindly, that's 150,000 tokens per task. If each gets a 1,550-token curated context, that's 7,750 tokens. The economics of structured context scale better than the economics of search.
Caveats and Honesty
I want to be transparent about what this experiment is and isn't:
- 20 queries on a single repo. Not a peer-reviewed study. Early data, not final conclusions.
- The curated docs took ~2 hours to create (with AI assistance). That's a real cost. Graphify's graph builds automatically.
- We measured keyword coverage, not task completion. A proper study would have the agent actually fix the bug and grade the result.
- I built Personize -- the SmartContext system that routes these docs. I have a commercial interest in this approach working well.
We're designing a proper 3-arm experiment with human grading, multiple repetitions, and statistical tests for publication. The methodology and raw data from this preliminary test are documented in our experiments folder.
What I'd Recommend
If you're working with AI coding agents on a production codebase:
- Start with 5-8 module docs for your most connected modules. Two hours of writing saves thousands of tokens per session.
- Use Graphify for impact analysis and structural queries -- it genuinely excels there.
- Don't rely on either alone. Curated docs for intent, graphs for structure, raw search when you need the actual code.
- Front-load your docs with domain keywords in the first paragraph. Embedding similarity scores are driven ~80% by the first 1,000 characters.
The goal isn't to replace code search. It's to make the agent read the right 2 files instead of the wrong 10.
References
- Graphify -- Open-Source Knowledge Graph Skill -- The tool that inspired this comparison
- Enterprise AI Knowledge Management 2026: Microsoft's Shift to Governed Agent Workflows
- From RAG to Context -- 2025 Year-End Review -- RAGFlow's analysis of retrieval evolving into context engines
- The Enterprise AI Stack in 2026: Models, Agents, and Infrastructure -- Multi-agent architecture trends
- Enterprise RAG: Building an AI Knowledge Base in 2026 -- Governance and structured retrieval patterns
- A Systematic Framework for Enterprise Knowledge Retrieval -- IEEE CAI 2026, LLM-generated metadata for RAG systems
- Databricks: Memory Scaling for AI Agents -- Production memory architectures
- Potpie: Knowledge Graphs for Enterprise Codebases -- Graph-powered code intelligence at scale