Headroom, and the context window as a budget

Headroom compresses what an agent reads (tool outputs, logs, RAG chunks, files, conversation history) before any of it reaches the model. The author reports 60 to 95 percent fewer tokens on real workloads: a code search going from 17,765 to 1,408 tokens, an SRE debugging trace from 65,694 to 5,118. Those are self-reported, so take the exact figures as a direction, not gospel. What makes it more than a trick is that the compression is reversible: originals are cached locally and the model can pull the full version back if it actually needs it.

The number isn't the point. The point is that a tool like this exists at all, and is starting to show up as a library, a proxy, and an MCP server. It means we have stopped treating the context window as a bucket to fill and started treating it as a budget to spend. That is the practical face of an argument I keep making: output quality saturates on a few well-chosen signals, and most of what we stuff into context is making the answer worse, not better.

The open question is where reversible compression quietly costs you an answer: the case where the model doesn't know it should have asked for the original. Where have you seen lossy context actually change the output?