Headroom: open-source token compression for AI agents

June 24, 2026·AI Newsroom·11 min read·12 sources

headroom ai-agents token-optimization context-compression mcp claude-code+12

Headroom open-source repository social preview card — Image: GitHub / headroomlabs-ai/headroom repository (Apache 2.0)

On 2026-06-22, Headroom shipped v0.27.0 — a release that adds headroom update with a release banner, headroom doctor for setup diagnostics, hot-reload of live proxy env knobs (so a proxy started by headroom wrap picks up new settings without a restart), tabular .xlsx/.xls compression, a turnkey Claude Code + Vertex compression path, Cortex Code (Snowflake CoCo) as a supported agent, and a cc-switch reconciler that keeps Headroom in the request path alongside other LLM routers. As of 2026-06-24 the repository had reached 48,803 stars, 3,406 forks, and 368 open issues — six months after its first commit on 2026-01-07 (GitHub API, 2026-06-24; v0.27.0 release, 2026-06-22).

Headroom is an open-source context compression layer for AI coding agents. It compresses tool outputs, logs, RAG chunks, files, and conversation history before they reach the LLM. The published benchmarks in the README and the docs benchmarks page report 60-95% token reduction on real agent workloads, with accuracy preserved or improved on GSM8K, TruthfulQA, SQuAD v2, and BFCL. It is licensed under Apache 2.0, runs entirely locally (data does not leave the machine), and ships in four install modes: library, local proxy, agent wrapper, and MCP server.

What happened

The release. v0.27.0 shipped on 2026-06-22 with 13 feature commits plus fixes. The release notes name the user-facing additions explicitly: headroom update with a release banner, headroom doctor for setup diagnostics, hot-reload of live proxy env knobs (no cold start, no dropped requests, no lost caches), tabular spreadsheet compression, a cc-switch reconciler, and measured token throughput (tokens/sec) surfaced through the proxy. Output-token reduction — verbosity shaper, per-user learning, counterfactual savings — is the headline feature: a new HEADROOM_OUTPUT_SHAPER=1 env knob trims what the model writes back, with headroom learn --verbosity to learn the right terseness from past sessions. Cortex Code (Snowflake CoCo) and Vertex-hosted Claude Code are added to the provider matrix. The Rust compression engine got knob exposure and CCR hardening, and proxy audits landed for traffic visibility.

The repo, in numbers. On 2026-06-24 the GitHub API returns 48,803 stars, 3,406 forks, 368 open issues, primary language Python, license Apache-2.0, created 2026-01-07T19:58:51Z, last push 2026-06-24T03:47:08Z. Thirty tagged releases sit on the releases page, with v0.27.0, v0.26.0 (2026-06-16), and v0.25.0 (2026-06-12) as the three most recent — a roughly biweekly cadence through June 2026. CI runs on GitHub Actions with Codecov; packages ship to PyPI and npm; a Docker image publishes to ghcr.io/chopratejas/headroom:latest. The default branch is main.

The four install modes. From the README and the docs, Headroom installs in any of four shapes:

Library — compress(messages) inline in Python or TypeScript.
Proxy — headroom proxy --port 8787, drop-in, zero code changes, any language or OpenAI-compatible client.
Agent wrap — headroom wrap claude|codex|cursor|aider|copilot|opencode in one command; the wrapper injects the proxy config and launches the agent.
MCP server — headroom mcp install registers headroom_compress, headroom_retrieve, and headroom_stats tools for any MCP client.

The headline installer is pip install "headroom-ai[all]" (Python 3.10+); granular extras split out [proxy], [mcp], [ml], [code], [memory], [relevance], [image], [agno], [langchain], [evals], and [pytorch-mps]. An npm install headroom-ai path is published alongside, and a Docker image is on ghcr.io. The README’s headroom perf and headroom dashboard commands report live savings once the proxy is running.

The four transforms. Inside the pipeline, ContentRouter detects the content type and selects the right compressor:

SmartCrusher — universal JSON: arrays of dicts, nested objects, mixed types. The docs call out that it preserves the first N items (schema), the last N items (recency), all anomalies (errors, warnings), and the statistical distribution of the rest.
CodeCompressor — AST-aware compression for Python, JS, Go, Rust, Java, and C++.
Kompress-base — the project’s own text model, trained on agentic traces, hosted on Hugging Face as chopratejas/kompress-v2-base.
Image compression — 40-90% reduction via a trained ML router, documented separately in the image compression page.

Two adjacent pieces round out the architecture. CacheAligner stabilizes prompt prefixes so Anthropic and OpenAI KV caches actually hit. CCR (Reversible Compression) stores originals locally with a configurable TTL; if the LLM needs an original, it calls headroom_retrieve and gets it back. Cross-agent memory and headroom learn — which mines failed sessions and writes corrections into CLAUDE.md or AGENTS.md — extend the project beyond pure compression.

Why it matters

Three reasons this is worth a builder-reader’s attention on 2026-06-24.

1. Token cost is still the top complaint about agent loops in 2026. Tool outputs, RAG chunks, and conversation history stack up fast; a single multi-tool agent turn can easily push 50-100K input tokens. Headroom’s README publishes concrete per-workload numbers — not just a “60-95%” headline. The published savings table is:

Workload	Before	After	Savings
Code search (100 results)	17,765 tok	1,408 tok	92%
SRE incident debugging	65,694 tok	5,118 tok	92%
GitHub issue triage	54,174 tok	14,761 tok	73%
Codebase exploration	78,502 tok	41,254 tok	47%

The accuracy story is the second half: GSM8K ±0.000, TruthfulQA +0.030, SQuAD v2 97% at 19% compression, BFCL 97% at 32% compression — published in the README proof section and the docs benchmarks page. The README is direct about the zero-compression cases too: grep results and Python source already arrive compact, so SmartCrusher passes them through to preserve correctness.

2. v0.27.0 moves the project from “compress the prompt” to “compress both directions” — and the output side is where the money is on Opus-class models. The README is explicit: “on Opus-class models output costs 5× input.” v0.27.0’s output-token reduction adds two pieces:

Verbosity steering — a short “be terse, don’t restate context” line appended to the end of the system prompt, so the prompt prefix and the provider KV cache still hit.
Effort routing — when a turn is just the model resuming after a tool result (a file read, a passing test), the proxy dials the model’s thinking effort down. New questions and errors keep full effort.

Crucially, the project does not pretend to know what the model would have written. headroom output-savings reports an estimate with a confidence interval by default, labelled estimated; setting HEADROOM_OUTPUT_HOLDOUT=0.1 leaves 10% of conversations unshaped as a control group and labels the dashboard card measured instead. This is the honest way to report a counterfactual number, and the proposal document walks through the measurement methodology.

3. The compatibility matrix is broad enough to drop in without rewriting a stack. From the README compatibility table, headroom wrap works with Claude Code, Codex, Cursor, Aider, Copilot CLI, OpenClaw, OpenCode, and Cortex Code. MCP-native clients install via headroom mcp install. OpenAI-compatible clients work through the proxy. Library integrations exist for the Vercel AI SDK (wrapLanguageModel({ model, middleware: headroomMiddleware() })), Anthropic and OpenAI SDKs (withHeadroom(new Anthropic())), LiteLLM, LangChain, Agno, Strands, and ASGI apps. OpenCode is explicitly listed in the matrix — relevant to this site’s own stack.

The community signal is the third data point. 48,803 stars in six months is unusually fast for a developer-tooling repo. The star history chart shows a clean ramp through Q1 2026 and a continued climb through June. Discord is linked from the README; AGENTS.md and CONTRIBUTING.md ship in the repo; CI is green; Codecov is wired up.

Practical implications

For practitioners, four concrete takeaways.

The fastest path is one command. From the README’s Get started (60 seconds) section:
```
pip install "headroom-ai[all]"
headroom wrap claude
```
headroom wrap injects the proxy config and launches the agent with compression already in the request path. The headroom doctor command added in v0.27.0 surfaces setup problems on the first run. The headroom update command added in the same release ships a release banner, so users see the new version notification without leaving the terminal.
The proxy mode is the right answer for a stack you do not own. headroom proxy --port 8787 exposes a local proxy on port 8787; any OpenAI-compatible client (or any of the supported coding agents) can be pointed at it. v0.27.0’s hot-reload change matters here: a proxy that headroom wrap reused rather than started would not have picked up env knobs you exported after launch, because its environment was snapshotted. v0.27.0’s POST /admin/runtime-env loopback sync means the running proxy reads live settings on every request — no cold start, no dropped requests, no lost caches. On a shared proxy these overrides are global and the last explicit setting wins.
The proxy model is local, not hosted. From the architecture page and the README, the compression runs on the user’s machine. The ML model (Kompress-v2-base) downloads from Hugging Face on first use; thereafter it runs locally via ONNX Runtime (Rust). The CCR cache is local with a configurable TTL. Headroom is a local proxy/library, not a SaaS — there is no Headroom-hosted server between the agent and the LLM provider.
The output-side shaper is opt-in for a reason. HEADROOM_OUTPUT_SHAPER=1 is off by default. The README is explicit that output savings are counterfactual — Headroom never sees what the model would have written, so the default report is an estimate with a 95% confidence interval, not a number. To get a measured number, set HEADROOM_OUTPUT_HOLDOUT=0.1 and compare the shaped and unshaped groups in the dashboard. A builder who wants the savings without the asterisks should configure the holdout up front.

A small, concrete worked example is worth quoting. The README’s headline GIF caption: “Live: 10,144 → 1,260 tokens — same FATAL found.” That is a 100-entry production log with a critical error at position 67, compressed to 1,260 tokens, with the model still answering all four questions (error, error code, resolution, affected count) correctly. The docs page JSON compression (SmartCrusher) walks through the same case with the full before/after table.

Risks and caveats

The brief’s flag is highRiskClaims: false: this is an open-source repository, not a security incident, regulatory matter, or contested claim. The risks are framing-level, and the article must keep the lines clearly drawn.

The 60-95% range is a range, not a guarantee. The published numbers are 47% on the hardest case (codebase exploration) and 92% on the best cases (code search, SRE debugging). A builder who runs a workload dominated by structured JSON arrays of dicts will see the upper end; a builder who runs a workload dominated by already-compact source code or grep output will see 0% (the docs call this out explicitly, with a callout box: “Zero compression is intentional”). The headline is a true range across the published workloads, not a uniform claim.
The accuracy numbers are N=100. GSM8K, TruthfulQA, SQuAD v2, and BFCL are run on samples of 100 each, per the README proof table. That is enough to show no regression and a small improvement on TruthfulQA; it is not enough to support strong claims about general behaviour. A production deployment should run its own A/B before turning on aggressive compression.
Output savings are counterfactual by default. As noted above, headroom output-savings reports an estimate with a 95% confidence interval unless HEADROOM_OUTPUT_HOLDOUT=0.1 is configured. The README is direct about this and includes a holdout path for measurement. The article should not present the output-side savings number as if it were measured.
The ML model downloads from Hugging Face on first use. A corporate environment with strict egress controls may block the download, or may require a pre-mirror of chopratejas/kompress-v2-base. The Rust compression engine adds a Rust toolchain dependency for some install paths. The README is honest about both: “Corporate SSL-inspection environments may need Rust pre-installed or a prebuilt wheel.”
The proxy intercepts all agent-to-LLM traffic. Headroom is a local proxy that sits between the agent and the provider. Users who run multiple agents, who route through other proxies (LiteLLM, custom gateways, the cc-switch reconciler that v0.27.0 explicitly accommodates), or who work in regulated environments should audit the request flow. The v0.27.0 release includes a new traffic-audit feature, which is the right answer; running the feature and inspecting the audit log is the verification step.
Six months old, fast-moving. 48,803 stars in six months is a strong adoption signal, and the biweekly release cadence is healthy, but the project is young. The README’s When to skip section is honest: skip it if you only use a single provider’s native compaction and do not need cross-agent memory, or if you work in a sandboxed environment where local processes cannot run. The article should not overstate maturity.
“Comparable” tools are not the same category. RTK, lean-ctx, and Compresr are all named in the README’s competitive landscape, but they target narrower scopes (CLI outputs, MCP context, hosted text API). The article’s comparison table in the brief lists the differences correctly; the body must keep the category boundaries clear and not frame Headroom as a drop-in replacement for any of them.

What to watch

Five follow-up signals to track over the next quarter.

Adoption beyond the 48K-star core. Whether Claude Code, Codex, Cursor, Aider, and the open-source agent harnesses (smolagents, LangChain, Agno) ship first-party Headroom integration. The v0.27.0 hot-reload path and the cc-switch reconciler both suggest the project is preparing for use alongside other LLM routers; the next step is framework-level integration.
Output-side measurements at scale. Whether the HEADROOM_OUTPUT_HOLDOUT pattern gets adopted broadly enough to publish measured numbers on the output side. The proposal document is the right shape, and a published measured-with-holdout number across a real user base would be the next milestone.
Expansion of the supported agent matrix. v0.27.0 added Cortex Code and Vertex-hosted Claude Code; the next quarter will tell whether the matrix keeps growing (Cline, Continue.dev, Cody) or stabilises. A growing matrix with each agent added cleanly (memory sharing, code-graph integration) is the right shape.
Kompress-v2 model iterations. The text model is the one piece of the pipeline that depends on training-data choices. A v3 release with broader language coverage or improved accuracy on long-context code would shift the headline numbers.
Risk: a category leader emerges from a major agent vendor. Anthropic, OpenAI, Cursor, and JetBrains all ship some form of context compaction today. If a major vendor ships something equivalent to Headroom’s reversible-compression + cross-agent memory combination, the open-source window narrows. The next 6-12 months will tell.

Sources

Live-verified 2026-06-24 against the live repository, the README on main, the v0.27.0 release notes, the docs site, the GitHub API, and the Hugging Face model card.

Primary

Headroom repository on GitHub — 48,803 stars, 3,406 forks, 368 open issues, Apache 2.0, primary language Python, created 2026-01-07, last push 2026-06-24. Verified via the GitHub REST API on 2026-06-24.
Headroom README on main — installation, the four modes, the proof table (92/92/73/47% savings, GSM8K/TruthfulQA/SQuAD v2/BFCL accuracy), the output-token reduction section with HEADROOM_OUTPUT_SHAPER and HEADROOM_OUTPUT_HOLDOUT, the compatibility matrix, the when to skip section.
Headroom v0.27.0 release — published 2026-06-22T05:57:31Z. 13 feature commits plus fixes: headroom doctor, headroom update with release banner, hot-reload of live proxy env knobs via POST /admin/runtime-env, tabular .xlsx/.xls compression, cc-switch reconciler, measured token throughput, output-token reduction (verbosity shaper + per-user learning + counterfactual savings), Cortex Code (Snowflake CoCo), Vertex-hosted Claude Code turnkey, Rust compression knob exposure, CCR hardening, traffic audits.
Headroom docs — Architecture — pipeline diagram and the role of each transform (CacheAligner, ContentRouter, SmartCrusher, CodeCompressor, Kompress-base, IntelligentContext, RollingWindow, CCR).
Headroom docs — Benchmarks — compression performance on Apple M-series CPU at v0.5.18 (66.1% total savings, 5ms total latency), JSON compression with the 10,144 → 1,260 tokens worked example, QA accuracy preservation, SDK latency per scenario, cost-benefit analysis at Claude Sonnet pricing, pipeline step timing.
Headroom docs — CCR (Reversible Compression) — the retrieval-on-demand model and the configurable TTL.
Headroom docs — Output token reduction proposal — measurement methodology, the 95% confidence interval approach, the holdout pattern.
Headroom on PyPI — headroom-ai package with the [all] extra and the granular extras ([proxy], [mcp], [ml], [code], [memory], [relevance], [image], [agno], [langchain], [evals], [pytorch-mps]).
Headroom on npm — headroom-ai package for Node/TypeScript.
Kompress-v2-base model card on Hugging Face — the text model used by the Kompress-base transform.
Headroom GitHub REST API metadata — stars, forks, license, created, last push, default branch.

Secondary

Headroom star history chart — the ramp from January 2026 to June 2026.