Headroom: open-source token compression for AI agents

headroomai-agentstoken-optimizationcontext-compressionmcpclaude-code+12
Headroom open-source repository social preview card
Image: GitHub / headroomlabs-ai/headroom repository (Apache 2.0)

On 2026-06-22, Headroom shipped v0.27.0 — a release that adds headroom update with a release banner, headroom doctor for setup diagnostics, hot-reload of live proxy env knobs (so a proxy started by headroom wrap picks up new settings without a restart), tabular .xlsx/.xls compression, a turnkey Claude Code + Vertex compression path, Cortex Code (Snowflake CoCo) as a supported agent, and a cc-switch reconciler that keeps Headroom in the request path alongside other LLM routers. As of 2026-06-24 the repository had reached 48,803 stars, 3,406 forks, and 368 open issues — six months after its first commit on 2026-01-07 (GitHub API, 2026-06-24; v0.27.0 release, 2026-06-22).

Headroom is an open-source context compression layer for AI coding agents. It compresses tool outputs, logs, RAG chunks, files, and conversation history before they reach the LLM. The published benchmarks in the README and the docs benchmarks page report 60-95% token reduction on real agent workloads, with accuracy preserved or improved on GSM8K, TruthfulQA, SQuAD v2, and BFCL. It is licensed under Apache 2.0, runs entirely locally (data does not leave the machine), and ships in four install modes: library, local proxy, agent wrapper, and MCP server.

What happened

The release. v0.27.0 shipped on 2026-06-22 with 13 feature commits plus fixes. The release notes name the user-facing additions explicitly: headroom update with a release banner, headroom doctor for setup diagnostics, hot-reload of live proxy env knobs (no cold start, no dropped requests, no lost caches), tabular spreadsheet compression, a cc-switch reconciler, and measured token throughput (tokens/sec) surfaced through the proxy. Output-token reduction — verbosity shaper, per-user learning, counterfactual savings — is the headline feature: a new HEADROOM_OUTPUT_SHAPER=1 env knob trims what the model writes back, with headroom learn --verbosity to learn the right terseness from past sessions. Cortex Code (Snowflake CoCo) and Vertex-hosted Claude Code are added to the provider matrix. The Rust compression engine got knob exposure and CCR hardening, and proxy audits landed for traffic visibility.

The repo, in numbers. On 2026-06-24 the GitHub API returns 48,803 stars, 3,406 forks, 368 open issues, primary language Python, license Apache-2.0, created 2026-01-07T19:58:51Z, last push 2026-06-24T03:47:08Z. Thirty tagged releases sit on the releases page, with v0.27.0, v0.26.0 (2026-06-16), and v0.25.0 (2026-06-12) as the three most recent — a roughly biweekly cadence through June 2026. CI runs on GitHub Actions with Codecov; packages ship to PyPI and npm; a Docker image publishes to ghcr.io/chopratejas/headroom:latest. The default branch is main.

The four install modes. From the README and the docs, Headroom installs in any of four shapes:

The headline installer is pip install "headroom-ai[all]" (Python 3.10+); granular extras split out [proxy], [mcp], [ml], [code], [memory], [relevance], [image], [agno], [langchain], [evals], and [pytorch-mps]. An npm install headroom-ai path is published alongside, and a Docker image is on ghcr.io. The README’s headroom perf and headroom dashboard commands report live savings once the proxy is running.

The four transforms. Inside the pipeline, ContentRouter detects the content type and selects the right compressor:

Two adjacent pieces round out the architecture. CacheAligner stabilizes prompt prefixes so Anthropic and OpenAI KV caches actually hit. CCR (Reversible Compression) stores originals locally with a configurable TTL; if the LLM needs an original, it calls headroom_retrieve and gets it back. Cross-agent memory and headroom learn — which mines failed sessions and writes corrections into CLAUDE.md or AGENTS.md — extend the project beyond pure compression.

Why it matters

Three reasons this is worth a builder-reader’s attention on 2026-06-24.

1. Token cost is still the top complaint about agent loops in 2026. Tool outputs, RAG chunks, and conversation history stack up fast; a single multi-tool agent turn can easily push 50-100K input tokens. Headroom’s README publishes concrete per-workload numbers — not just a “60-95%” headline. The published savings table is:

Workload Before After Savings
Code search (100 results) 17,765 tok 1,408 tok 92%
SRE incident debugging 65,694 tok 5,118 tok 92%
GitHub issue triage 54,174 tok 14,761 tok 73%
Codebase exploration 78,502 tok 41,254 tok 47%

The accuracy story is the second half: GSM8K ±0.000, TruthfulQA +0.030, SQuAD v2 97% at 19% compression, BFCL 97% at 32% compression — published in the README proof section and the docs benchmarks page. The README is direct about the zero-compression cases too: grep results and Python source already arrive compact, so SmartCrusher passes them through to preserve correctness.

2. v0.27.0 moves the project from “compress the prompt” to “compress both directions” — and the output side is where the money is on Opus-class models. The README is explicit: “on Opus-class models output costs 5× input.” v0.27.0’s output-token reduction adds two pieces:

Crucially, the project does not pretend to know what the model would have written. headroom output-savings reports an estimate with a confidence interval by default, labelled estimated; setting HEADROOM_OUTPUT_HOLDOUT=0.1 leaves 10% of conversations unshaped as a control group and labels the dashboard card measured instead. This is the honest way to report a counterfactual number, and the proposal document walks through the measurement methodology.

3. The compatibility matrix is broad enough to drop in without rewriting a stack. From the README compatibility table, headroom wrap works with Claude Code, Codex, Cursor, Aider, Copilot CLI, OpenClaw, OpenCode, and Cortex Code. MCP-native clients install via headroom mcp install. OpenAI-compatible clients work through the proxy. Library integrations exist for the Vercel AI SDK (wrapLanguageModel({ model, middleware: headroomMiddleware() })), Anthropic and OpenAI SDKs (withHeadroom(new Anthropic())), LiteLLM, LangChain, Agno, Strands, and ASGI apps. OpenCode is explicitly listed in the matrix — relevant to this site’s own stack.

The community signal is the third data point. 48,803 stars in six months is unusually fast for a developer-tooling repo. The star history chart shows a clean ramp through Q1 2026 and a continued climb through June. Discord is linked from the README; AGENTS.md and CONTRIBUTING.md ship in the repo; CI is green; Codecov is wired up.

Practical implications

For practitioners, four concrete takeaways.

A small, concrete worked example is worth quoting. The README’s headline GIF caption: “Live: 10,144 → 1,260 tokens — same FATAL found.” That is a 100-entry production log with a critical error at position 67, compressed to 1,260 tokens, with the model still answering all four questions (error, error code, resolution, affected count) correctly. The docs page JSON compression (SmartCrusher) walks through the same case with the full before/after table.

Risks and caveats

The brief’s flag is highRiskClaims: false: this is an open-source repository, not a security incident, regulatory matter, or contested claim. The risks are framing-level, and the article must keep the lines clearly drawn.

  1. The 60-95% range is a range, not a guarantee. The published numbers are 47% on the hardest case (codebase exploration) and 92% on the best cases (code search, SRE debugging). A builder who runs a workload dominated by structured JSON arrays of dicts will see the upper end; a builder who runs a workload dominated by already-compact source code or grep output will see 0% (the docs call this out explicitly, with a callout box: “Zero compression is intentional”). The headline is a true range across the published workloads, not a uniform claim.

  2. The accuracy numbers are N=100. GSM8K, TruthfulQA, SQuAD v2, and BFCL are run on samples of 100 each, per the README proof table. That is enough to show no regression and a small improvement on TruthfulQA; it is not enough to support strong claims about general behaviour. A production deployment should run its own A/B before turning on aggressive compression.

  3. Output savings are counterfactual by default. As noted above, headroom output-savings reports an estimate with a 95% confidence interval unless HEADROOM_OUTPUT_HOLDOUT=0.1 is configured. The README is direct about this and includes a holdout path for measurement. The article should not present the output-side savings number as if it were measured.

  4. The ML model downloads from Hugging Face on first use. A corporate environment with strict egress controls may block the download, or may require a pre-mirror of chopratejas/kompress-v2-base. The Rust compression engine adds a Rust toolchain dependency for some install paths. The README is honest about both: “Corporate SSL-inspection environments may need Rust pre-installed or a prebuilt wheel.”

  5. The proxy intercepts all agent-to-LLM traffic. Headroom is a local proxy that sits between the agent and the provider. Users who run multiple agents, who route through other proxies (LiteLLM, custom gateways, the cc-switch reconciler that v0.27.0 explicitly accommodates), or who work in regulated environments should audit the request flow. The v0.27.0 release includes a new traffic-audit feature, which is the right answer; running the feature and inspecting the audit log is the verification step.

  6. Six months old, fast-moving. 48,803 stars in six months is a strong adoption signal, and the biweekly release cadence is healthy, but the project is young. The README’s When to skip section is honest: skip it if you only use a single provider’s native compaction and do not need cross-agent memory, or if you work in a sandboxed environment where local processes cannot run. The article should not overstate maturity.

  7. “Comparable” tools are not the same category. RTK, lean-ctx, and Compresr are all named in the README’s competitive landscape, but they target narrower scopes (CLI outputs, MCP context, hosted text API). The article’s comparison table in the brief lists the differences correctly; the body must keep the category boundaries clear and not frame Headroom as a drop-in replacement for any of them.

What to watch

Five follow-up signals to track over the next quarter.

  1. Adoption beyond the 48K-star core. Whether Claude Code, Codex, Cursor, Aider, and the open-source agent harnesses (smolagents, LangChain, Agno) ship first-party Headroom integration. The v0.27.0 hot-reload path and the cc-switch reconciler both suggest the project is preparing for use alongside other LLM routers; the next step is framework-level integration.
  2. Output-side measurements at scale. Whether the HEADROOM_OUTPUT_HOLDOUT pattern gets adopted broadly enough to publish measured numbers on the output side. The proposal document is the right shape, and a published measured-with-holdout number across a real user base would be the next milestone.
  3. Expansion of the supported agent matrix. v0.27.0 added Cortex Code and Vertex-hosted Claude Code; the next quarter will tell whether the matrix keeps growing (Cline, Continue.dev, Cody) or stabilises. A growing matrix with each agent added cleanly (memory sharing, code-graph integration) is the right shape.
  4. Kompress-v2 model iterations. The text model is the one piece of the pipeline that depends on training-data choices. A v3 release with broader language coverage or improved accuracy on long-context code would shift the headline numbers.
  5. Risk: a category leader emerges from a major agent vendor. Anthropic, OpenAI, Cursor, and JetBrains all ship some form of context compaction today. If a major vendor ships something equivalent to Headroom’s reversible-compression + cross-agent memory combination, the open-source window narrows. The next 6-12 months will tell.

Sources

Live-verified 2026-06-24 against the live repository, the README on main, the v0.27.0 release notes, the docs site, the GitHub API, and the Hugging Face model card.

Primary

Secondary