caveman: Julius Brussee's terse-output skill

June 20, 2026·AI Newsroom·15 min read·14 sources

open-source typescript claude-code codex gemini cursor+20high-risk claims

caveman open-source repository social preview card — Image: GitHub / JuliusBrussee/caveman repository (MIT)

On 2026-06-20, the GitHub repository JuliusBrussee/caveman — a TypeScript “skill” (a plain Markdown ruleset plus an installer) for 30+ agent platforms, with Claude Code / Codex / Gemini / Cursor / Windsurf / Cline / Copilot / OpenClaw / opencode as the headline targets — had reached 74,940 stars, 4,230 forks, 299 open issues, 201 commits, 15 releases (latest v1.9.0 “Rock pinned. Rock verified. opencode rock work now.” on 2026-06-12), and MIT licensing (repository metadata, 2026-06-20; README, 2026-06-20; releases, 2026-06-20). The repo was created on 2026-04-04, so on 2026-06-20 it is roughly 11 weeks old, not the “two weeks” sometimes cited in third-party coverage (repository metadata, 2026-06-20). The skill tells the agent to drop filler, use fragments, and keep code / URLs / paths byte-exact. The README ships a benchmark table of 10 real Claude API prompts (average 1,214 → 294 output tokens = 65% reduction, range 22–87%), a caveman-compress sub-skill with five real memory-file receipts (average 898 → 481 = 46% reduction), a three-arm eval harness (baseline / terse / skill) in evals/, raw data and a reproduction script in benchmarks/, and an > [!IMPORTANT] block the article must lead with: “Caveman only affects output tokens — thinking/reasoning tokens untouched. Caveman no make brain smaller. Caveman make mouth smaller. Biggest win is readability and speed, cost savings a bonus.” (README, 2026-06-20; benchmarks/, 2026-06-20; evals/, 2026-06-20). The under-reported angle is the ecosystem, not the skill: caveman is one of five sibling repos in the same philosophy — caveman-code (whole terminal coding agent), cavemem (cross-agent memory), cavekit (spec-driven build loop), and cavegemma (Gemma 4 31B fine-tune with the caveman style welded into the weights) — plus a separate JuliusBrussee/skills pack with four more skills (grill-me, interface-kit, junior-to-senior, loop-factory) installed with one command (Caveman Ecosystem, 2026-06-20; caveman-code, 2026-06-20; cavemem, 2026-06-20; cavekit, 2026-06-20; cavegemma, 2026-06-20; skills, 2026-06-20). The load-bearing caveat sits in the README’s own Important box and is quoted verbatim above; the article preserves it in the lede, in the Risks and caveats section, and in the Verdict.

What it is

The skill. The caveman skill is a Markdown ruleset plus a small installer. The README’s “How It Work” section describes the mechanism in three lines: “Install drop skill file in agent. Skill tell agent: drop filler, keep substance, use fragments.” (README — How It Work, 2026-06-20). For Claude Code / Codex / Gemini, the install drops a hook that writes a tiny flag file each session so the agent talks caveman from message one without the user having to remember /caveman. The skill does not change model selection, MCP wiring, or tool-calling semantics — it is a writing-style layer.

Four install paths. The README documents four primary install paths (README — Install, 2026-06-20):

macOS / Linux / WSL / Git Bash: curl -fsSL https://raw.githubusercontent.com/JuliusBrussee/caveman/main/install.sh | bash
Windows (PowerShell 5.1+): irm https://raw.githubusercontent.com/JuliusBrussee/caveman/main/install.ps1 | iex
Universal AGENTS.md / IDE rule files: npx skills add JuliusBrussee/caveman
OpenClaw scope: curl -fsSL https://raw.githubusercontent.com/JuliusBrussee/caveman/main/install.sh | bash -s -- --only openclaw (skill drop at ~/.openclaw/workspace/skills/caveman/SKILL.md plus a marker-fenced SOUL.md block so OpenClaw injects the brevity rule every turn under “Project Context”)

The install takes ~30 seconds, requires Node ≥ 18, detects which AI coding agents are on the host and runs each one’s native install (plugin / extension / skill / rule file), skips what isn’t there, and is safe to re-run (README — Install, 2026-06-20). The v1.9.0 release tightened the security model: the curl|bash one-liner now downloads hook files from the immutable v1.9.0 tag and verifies every hook against src/hooks/checksums.sha256 (SHA-256) before executing, with a mismatch aborting the install (v1.9.0 release notes, 2026-06-12).

Triggers and levels. The five primary triggers are: /caveman [lite|full|ultra|wenyan] (compress every reply, with the level sticking until session end), /caveman-commit (conventional commit messages, ≤50 char subject), /caveman-review (one-line PR comments), /caveman-stats (real session token usage + lifetime savings + USD, with a tweetable --share line), and /caveman-compress <file> (rewrite a memory file like CLAUDE.md into caveman-speak while preserving code, URLs, and paths byte-for-byte) (README — What You Get, 2026-06-20). The four levels are lite (drop filler), full (default caveman), ultra (telegraphic), and wenyan (classical Chinese, even shorter). Natural-language triggers are also wired in: “talk like caveman”, “be brief”, “be terse”, and “less tokens” all activate caveman mode; “normal mode” or “stop caveman” deactivates (v1.6.0 release notes, 2026-04-15).

Auto-activate and statusline. Auto-activation is built in for Claude Code, Codex, and Gemini. Cursor / Windsurf / Cline / Copilot / Continue / Kilo / Roo / Augment / Aider / Amp / Bob / Crush / Devin / Droid / ForgeCode / Goose / iFlow / Junie / Kiro / Mistral Vibe / OpenHands / opencode / Qwen Code / Qoder / Rovo Dev / Tabnine / Trae / Warp / Replit / Antigravity all get a per-agent native install via the smart installer, with always-on rule files written by --with-init for the IDEs that need a rule file instead of a hook (v1.7.0 release notes, 2026-05-01; README — Install, 2026-06-20). The Claude Code statusline shows [CAVEMAN] ⛏ 12.4k (lifetime tokens saved) and updates on every /caveman-stats run; silence with CAVEMAN_STATUSLINE_SAVINGS=0 (README — What You Get, 2026-06-20).

Language preservation. “Speak your tongue. Caveman keep your language. You write Portuguese, caveman grunt Portuguese. Spanish, French, same. Compress the style, not the language. Code, command, error string stay exact.” (README, 2026-06-20). The rule is enforced by the skill — code symbols, error strings, and URLs are protected, and the v1.7.0 release added Typst + LaTeX to the protected-content list so math and markup blocks pass through untouched (v1.7.0 release notes, 2026-05-01).

Why it matters

The viral signal is real, not vanity. 74,940 stars in ~11 weeks (4,230 forks, 201 commits, 15 releases) is consistent with a cultural-moment release, not a vanity spike (repository metadata, 2026-06-20). The value proposition is trivial to explain (“agent talk less, pay less”), trivial to try (one curl line, ~30 seconds, idempotent), and the savings are visible in the developer’s terminal on the first reply via the statusline badge — a feedback loop that drives word-of-mouth. The 15 release tags between 2026-04-04 and 2026-06-12 show a release cadence of roughly one every 1–2 weeks, with v1.7.0 (2026-05-01), v1.8.0 + v1.8.1 + v1.8.2 (2026-05-10 to 2026-05-12), and v1.9.0 (2026-06-12) as the major feature drops in the eight weeks before publication (releases, 2026-06-20).

The benchmark table is unusually honest for this kind of project. Most “save tokens” claims are vendor self-reports with no raw data and no eval harness. caveman publishes (README — Benchmarks, 2026-06-20):

The full 10-prompt table with raw counts — average 1,214 → 294 output tokens = 65% reduction, range 22–87%. Selected rows from the published table: Explain React re-render bug 1,180 → 159 (87%); Fix auth middleware token expiry 704 → 121 (83%); Refactor callback to async/await 387 → 301 (22%); Implement React error boundary 3,454 → 456 (87%); Debug PostgreSQL race condition 1,200 → 232 (81%).
The caveman-compress memory-file receipts (5 real files, byte-preserved code/URLs/paths): claude-md-preferences.md 706 → 285 (59.6%), project-notes.md 1,145 → 535 (53.3%), claude-md-project.md 1,122 → 636 (43.3%), todo-list.md 627 → 388 (38.1%), mixed-with-code.md 888 → 560 (36.9%), average 898 → 481 = 46%.
Raw data and a reproduction script in benchmarks/.
A three-arm eval harness in evals/ — baseline (the verbose default) / terse (a generic “Answer concisely” prompt) / skill (caveman) — so an independent reproducer can compare caveman against a strong generic baseline, not against the verbose default.
A reference to a March 2026 paper, “Brevity Constraints Reverse Performance Hierarchies in Language Models” (arXiv 2604.00025, 2026-04-02 last-modified), that found constraining large models to brief responses improved accuracy by 26 points on certain benchmarks.

This is a stronger source base than most “agent productivity” claims get — and the README is unusually direct about the methodology caveat: caveman is compared against the terse arm, not the baseline, “so the delta is honest” (README, 2026-06-20).

The ecosystem is the under-reported story. caveman is not a single skill. The README lists five repos in the same philosophy (README — Caveman Ecosystem, 2026-06-20):

Repo	What it does	Install	Stars / Forks / Issues (2026-06-20)
caveman	Output compression skill (this repo)	`curl ... \| bash` or `npx skills add`	74,940 / 4,230 / 299
caveman-code	Whole terminal coding agent — 4-layer token compression (model reply, tool output budgets, read dedup, optional RTK bash proxy)	`npm install -g @juliusbrussee/caveman-code`	548 / 60 / 26
cavemem	Cross-agent persistent memory, hybrid BM25 + local vectors, two native tools (`memory_search`, `memory_save`)	(peer of caveman)	551 / 48 / 37
cavekit	Spec-driven build loop — natural language to blueprints to parallel build plans to working software, with cross-model peer review	(peer of caveman)	1,044 / 75 / 14
cavegemma	Gemma 4 31B LoRA fine-tune (QLoRA NF4, rank 16, α 32, 3 epochs) on 1,750 train + 193 eval pairs; caveman-style welded into weights	HF: `JBrussee/gemma-4-31B-caveman` (62.5 GB merged) or `JBrussee/gemma-4-31B-caveman-lora` (534 MB)	56 / 10 / 0

The README’s “Caveman Ecosystem” section states the composition in one paragraph: “Compose: cavekit drive build, caveman compress what agent say, cavemem compress what agent remember, cavegemma bake compression into weight, caveman-code ship it all as one terminal agent. One rock. Two rock. Three rock. Four rock. Five rock. That it.” (README — Caveman Ecosystem, 2026-06-20). The sibling skills pack JuliusBrussee/skills (49 stars, 3 forks, MIT, 0 open issues) ships four more skills with one install: grill-me (agent grills your plan before you build the wrong thing), interface-kit (UI guidance), junior-to-senior (adversarial review pass), and loop-factory (spec-driven task loop). Install is npx skills@latest add JuliusBrussee/skills (skills, 2026-06-20).

What to watch

Seven follow-up signals, in order of how informative each one is about the project’s direction.

An independent third-party reproduction of the 10-prompt benchmark. The eval harness is in evals/, the raw data and reproduction script are in benchmarks/, and the methodology is documented (README — Benchmarks, 2026-06-20). As of 2026-06-20 no independent reproduction has been published. The next-cycle story is the first third-party re-run.
Thinking-token cost behavior on reasoning models. caveman only affects output tokens, and reasoning models (Claude with extended thinking, OpenAI o-series, Gemini Thinking) bill output + thinking together. The 65% output reduction may be partially offset by unchanged or larger thinking-token costs. The next-cycle story is the bill-line on a 50-turn reasoning-model session with caveman enabled.
caveman-code adoption and the published MicroBench. The README’s lede claim — “~2× fewer tokens than Codex on identical tasks” — is now backed by a published 25-task MicroBench dated 2026-05-18 (caveman-code README, 2026-06-20): 524k fresh tokens (caveman) vs 1,010k (Codex CLI), pass rate 14/25 vs 15/25 (one-task delta), raw CSV at research/results/honest-bench-2026-05-18.csv, aggregate JSON, methodology in research/README.md, task prompts in research/evals/microbench/tasks/, reproduction command npx tsx research/evals/run-honest-bench.ts --tools caveman,codex. The benchmark is still project self-report, and the pass-rate delta is “within one task,” not a clean win — but the raw data is published and the harness is one command away from re-running. The next-cycle story is the first third-party re-run on a non-gpt-5.5 model.
cavegemma real-world accuracy on non-coding tasks. The published 193-pair holdout (cavegemma README, 2026-06-20) shows compression 0.59–0.92 (vs gold caveman 0.3–0.5), code-fence exactness 96–100%, semantic similarity 91–98%, 81.5% eval accuracy, and the project is honest that “Compression weaker than gold pairs — model lands 0.6-0.9, gold sits 0.3-0.5. Filter accepted ≤1.0× source; tighten to ≤0.7 next run, push harder.” The next-cycle story is the second fine-tune run on a tighter filter, plus a holdout on a non-coding task family.
Broader-corpus caveman-compress evaluation. The 46% memory-file average is from 5 files in the published receipts; the floor is 36.9% (mixed-with-code.md) and the ceiling is 59.6% (claude-md-preferences.md). The next-cycle story is the median across a few hundred real CLAUDE.md / AGENTS.md / README files.
The OpenClaw skill-marketplace dynamics. The --only openclaw install drops a marker-fenced  ...  block into ~/.openclaw/workspace/SOUL.md so the brevity is auto-injected under “Project Context” on every turn, no per-session /caveman required (README — Lobster, Meet Rock, 2026-06-20). The next-cycle story is how OpenClaw’s skill marketplace treats a “one-line SOUL.md mutation” install versus a per-session /skill invocation.
The cavecrew-* subagent family on real production tasks. The “~60% fewer tokens than vanilla” claim is in the README’s What You Get table for the three subagents (cavecrew-investigator / cavecrew-builder / cavecrew-reviewer) and is the next-cycle under-reported story — the claim is one sentence, no published benchmark yet (README — What You Get, 2026-06-20).

Risks and caveats

The README opens with an > [!IMPORTANT] block and ships seven load-bearing caveats in the body. The article preserves all seven.

Thinking tokens are not reduced. “Caveman only affects output tokens — thinking/reasoning tokens untouched. Caveman no make brain smaller. Caveman make mouth smaller. Biggest win is readability and speed, cost savings a bonus.” (README — Important box, 2026-06-20). For reasoning models (Claude with extended thinking, OpenAI o-series, Gemini Thinking) the billable token count is output + thinking. The 65% / 22–87% numbers are output-token reductions only — the article does not present caveman as a blanket “65% off your bill.”
The 65% average hides a 22% floor. The Refactor callback to async/await task is 387 → 301 (22%) and the Architecture: microservices vs monolith task is 446 → 310 (30%) (README — Benchmarks, 2026-06-20). The README publishes the full table and the article quotes four rows verbatim. The right summary is “65% on average, with a 22% floor and an 87% ceiling, range 22–87%.”
The benchmark is project self-report. The 10-prompt set, the three-arm eval harness, and the reproduction script are all in the caveman repo — no independent third-party reproduction has been published as of 2026-06-20. The article reports the numbers with that caveat attached, and the What to watch section lists the third-party reproduction as signal #1.
The “1.93× fewer than Codex” claim is now backed by a published MicroBench, but is still self-report. The caveman-code README publishes a 25-task MicroBench dated 2026-05-18 with raw CSV, aggregate JSON, methodology, task prompts, and a one-line reproduction command (caveman-code README, 2026-06-20). The numbers are 524k fresh tokens (caveman) vs 1,010k (Codex CLI) = 1.93×, with pass rate 14/25 vs 15/25 (one-task delta). The article does not present this as a verified result; it presents the published harness and the “within one task” pass-rate framing that the README itself uses.
cavegemma has a published eval, not an external benchmark. The 193-pair holdout (cavegemma README, 2026-06-20) reports 81.5% accuracy, code-fence exactness 96–100%, semantic similarity 91–98%, and the project is direct that the compression is weaker than gold (0.6–0.9 vs gold 0.3–0.5) — “Filter accepted ≤1.0× source; tighten to ≤0.7 next run, push harder.” Treat as the maintainer’s own eval until a second fine-tune run ships or an independent holdout lands.
v1.9.0 is “Rock pinned” not “Rock perfect.” The skill is one Markdown file plus an installer; the v1.9.0 release notes ship the SHA-256 manifest and the immutable-tag install for the first time (v1.9.0 release notes, 2026-06-12), but there is no long-term compatibility contract. Pin to a specific release tag for production use, and verify the install in a clean sandbox before team rollout — the script touches ~/.claude/, ~/.codex/, ~/.gemini/, ~/.cursor/, and the OpenClaw workspace path.
The 299 open issues (130 issues + 169 PRs) are a velocity signal, not a stability signal. 299 open on 201 commits is a high issue/PR-to-commit ratio, consistent with a viral project where support load outpaces maintainer capacity (repository metadata, 2026-06-20). The article does not describe caveman as “battle-tested” or “production-grade” without a deployment caveat.

Additional wording cautions preserved:

No “caveman makes your agent faster.” It makes the output shorter; the time to first reply may decrease because the model emits fewer tokens, but that is downstream of output length, not a separate optimization. The README’s framing: “Biggest win is readability and speed, cost savings a bonus.”
No “caveman works with every agent.” It works with 30+ agents — the README’s specific list (Claude Code, Codex, Gemini, Cursor, Windsurf, Cline, Copilot, OpenClaw, opencode, Aider, Amp, Goose, Junie, Warp, Tabnine, Replit, and others) is the load-bearing claim.
No “compression algorithm.” It is a writing-style skill that asks the agent to use fragments. The token reduction is a downstream effect of style, not a separate compression step. README: “Install drop skill file in agent. Skill tell agent: drop filler, keep substance, use fragments.”
No copying of source prose. Only three short blocks are quoted verbatim — the Important box, the React re-render before/after, and the published 10-prompt table rows. Everything else is paraphrased.
No “99% fewer tokens” from a different repo. The “99.2% fewer tokens” number sometimes cited alongside caveman in third-party coverage is from DeusData/codebase-memory-mcp (5 structural queries ~3,400 tokens via codebase-memory-mcp versus ~412,000 via file-by-file grep). caveman’s own published savings are 65% (output compression) and 46% (memory-file compression). The cavegemma eval’s 96–100% code-fence exactness is a different number again (it’s the fraction of source code fences appearing byte-exact in the model’s output, not a token reduction). The three numbers are from three different repos and measure three different things; the article does not conflate.

Practical advice for builders

For Claude Code / Codex / Gemini / Cursor users spending real money on API bills. Install the one-liner (curl -fsSL https://raw.githubusercontent.com/JuliusBrussee/caveman/main/install.sh | bash, ~30 seconds, Node ≥ 18, safe to re-run) and trigger with /caveman [lite|full|ultra|wenyan] per session (README — Install, 2026-06-20). The auto-activate for Claude Code / Codex / Gemini (built-in) means most users do not need to remember the trigger. Set CAVEMAN_STATUSLINE_SAVINGS=1 to see lifetime savings in the Claude Code statusline. The honest first check is /caveman-stats — the --share output is the basis for any “should I keep this” decision. For a team rollout, pin to v1.9.0 (the first release with the immutable-tag install and SHA-256 manifest) rather than main.

For heavy agent users wiring caveman into a multi-agent or OpenClaw setup. Use the --with-init flag for Cursor / Windsurf / Cline / Copilot (the always-on rule file). For OpenClaw, the installer drops a marker-fenced  ...  block into ~/.openclaw/workspace/SOUL.md so the brevity is auto-injected every turn under “Project Context” — no per-session /caveman required (README — Lobster, Meet Rock, 2026-06-20). To remove cleanly, run the same one-liner with --uninstall. The cavecrew-* subagents are worth a separate look for any team that runs a multi-agent workflow (investigator / builder / reviewer roles). For a heavy per-tool-output workload, the caveman-code four-layer compression (model reply, tool output budgets, read dedup, optional RTK bash proxy) reports a −86% aggregate on 10 real tool-output fixtures and a +1.13M-token (~$6.92, Sonnet) net saving on a 30-turn session in the published bench (caveman-code README, 2026-06-20).

For operators / IT admins evaluating the skill for a team toolbox. The install is one-liner and idempotent; pin to v1.9.0 (the first release with the SHA-256 manifest enforced) rather than main; the skill affects only output style and does not change model selection, MCP wiring, or tool-calling semantics (v1.9.0 release notes, 2026-06-12). The honest framing is “writing-style skill, not a model swap.” The 299 open issues (130 issues + 169 PRs) on 201 commits is a velocity signal, not a stability signal — track them, do not block on them (repository metadata, 2026-06-20). Verify the install command in a clean sandbox before rolling out to the team — the script touches ~/.claude/, ~/.codex/, ~/.gemini/, ~/.cursor/, and the OpenClaw workspace path, and the v1.9.0 release is the first one that downloads from the immutable tag and verifies every hook against src/hooks/checksums.sha256 before executing. For a team using a reasoning model, set the billable-token expectation to “output tokens down 65% on average (22–87% range); thinking tokens unchanged” — not “65% off the bill.”

Verdict

On 2026-06-20, JuliusBrussee/caveman sat at 74,940 stars, 4,230 forks, 299 open issues, 201 commits, 15 releases (latest v1.9.0 on 2026-06-12), MIT-licensed, with a project-published benchmark of 10 real Claude API prompts showing 65% average output-token reduction (range 22–87%, raw data and reproduction script in the repo) and a caveman-compress sub-skill that cut 46% of tokens from real memory files (repository metadata, 2026-06-20; README, 2026-06-20; releases, 2026-06-20; benchmarks/, 2026-06-20; evals/, 2026-06-20). The under-reported story is the five-tool ecosystem: caveman-code (terminal coding agent, published 25-task MicroBench 2026-05-18 showing 1.93× fewer tokens than Codex CLI, 14/25 vs 15/25 pass rate), cavemem (cross-agent memory), cavekit (spec-driven build loop), and cavegemma (Gemma 4 31B LoRA fine-tune, 193-pair holdout, 81.5% eval accuracy, 96–100% code-fence exactness). The load-bearing caveat is in the README’s own Important box: “Caveman only affects output tokens — thinking/reasoning tokens untouched.” The 65% is honest for output billing and not for thinking-token billing on reasoning models. For AI Newsroom’s primary reader, the right takeaway is the writing-style layer — agent talks like a caveman, agent pays like a caveman — and the ecosystem framing, not the headline number. A real, daily-active, 75k-star skill. Pin to v1.9.0, run /caveman-stats, and judge by the receipt.