ponytail: MIT YAGNI Skill Cuts AI Agent Code by 80–94%

ponytailai-agentsyagniclaude-codecodexopencode+10
ponytail open-source repository social preview card
Image: GitHub / DietrichGebert/ponytail repository (MIT)

On June 12, 2026, an open-source developer named Dietrich Gebert shipped ponytail — a small, MIT-licensed ruleset and plugin that makes AI coding agents stop and ask, “does this need to exist?” before they type a line of code. Four days later, the project sits at v4.6.0 with 17,921 stars, 761 forks, and 8 releases in the space of a long weekend (GitHub repository, 2026-06-16; Releases, 2026-06-16). The pitch is one sentence — “He says nothing. He writes one line. It works.” — and the headline numbers in the README come from a reproducible benchmark that anyone can re-run with npx promptfoo eval -c benchmarks/promptfooconfig.yaml (ponytail README, 2026-06-16).

Ponytail is one of the first tools to take aim at the single most common complaint about AI-generated code: bloat. The benchmark measures 80–94% less code, 47–77% lower cost, and 3–6× faster runs than an unconfigured agent, on a five-task everyday suite (email validator, debounce, CSV sum, countdown timer, rate limiter) across Haiku, Sonnet, and Opus, with the median of 10 runs per cell (ponytail benchmarks/README, 2026-06-16). It works as a plugin in Claude Code, Codex, Gemini CLI, OpenCode, GitHub Copilot CLI, and Pi, and as a drop-in rules file in Cursor, Windsurf, Cline, GitHub Copilot (editor), Aider, and Kiro — i.e., essentially every agent harness a developer is likely to use today.

What it does

The core idea is a six-rung ladder the agent stops at before writing code:

1. Does this need to exist?   → no: skip it (YAGNI)
2. Stdlib does it?            → use it
3. Native platform feature?   → use it
4. Installed dependency?      → use it
5. One line?                  → one line
6. Only then: the minimum that works

(ponytail README, 2026-06-16)

The default mode is full, with lite, ultra, and off available for more or less aggressive trimming. The README is explicit about what is not on the chopping block: “Lazy, not negligent: trust-boundary validation, data-loss handling, security, and accessibility are never on the chopping block” (ponytail README, 2026-06-16).

Every shortcut the agent takes is marked in the code with a ponytail: comment naming the upgrade path — e.g., <!-- ponytail: browser has one --> next to a raw <input type="date"> (ponytail README, 2026-06-16). The /ponytail-debt command harvests those tags into a ledger so “later” doesn’t become “never”. That’s the safety property: minimal-by-default, reversible-by-tag.

Why it matters

The “AI code bloat” problem is now the dominant complaint from developers shipping AI-generated code. Agents default to over-engineering, wrap stdlib in custom code, install packages they don’t need, and ship 200 lines where 1 would do. Ponytail is the first portable, license-clean ruleset we’ve seen that addresses that failure mode at the prompt-engineering layer, not at the model layer — i.e., without retraining or fine-tuning.

The reach is unusual. The README’s install section is a tour of the current agent ecosystem:

(ponytail README, 2026-06-16)

The AGENTS.md always-on path is the long tail: any harness that reads the project’s AGENTS.md (a growing convention in the open-source agent space) inherits the ruleset with no install step. VS Code with the Codex extension reads AGENTS.md from the repo root by default; ~/.codex/AGENTS.md makes it global (ponytail README, 2026-06-16).

The numbers, with scope

The headline figures — 80–94% less code, 47–77% cheaper, 3–6× faster — apply to a five-task developer-grade benchmark, not to “all code ever written by an agent with ponytail installed.” That scope is load-bearing. The suite is small (email validator, debounce, CSV sum, countdown timer, rate limiter), and the README is candid: “These are everyday tasks. For production-grade specs, where an unconstrained agent bloats much harder, see the writeups in results/ (ponytail benchmarks/README, 2026-06-16).

The benchmarks/results/ directory is not a stub. As of 2026-06-16, it contains three real write-ups:

  1. 2026-06-12-v4-hardening-vs-caveman.md — a six-task production-grade A–F benchmark (log CLI, file sync, dispatcher, validation, auth, ledger) where ponytail v4 lands at 490 non-blank lines / 6 files vs. caveman’s 1,440 lines / 19 files (34% of caveman’s size) and passes 8/8 security probes and 6/6 concurrency probes with no regression from the added rules. The cost-to-extend comparison on surprise feature requests is 41 lines vs. caveman’s 156 on one task and 55 vs. 257 on the other — i.e., 74–79% cheaper to extend after the initial build (v4 hardening writeup, 2026-06-12).
  2. 2026-06-12-caveman-vs-ponytail.md — the earlier v1 → v2 → v3 evolution. v1 deliberately wrote minimal code but burned the savings on long “skipped on purpose” essays; v2 added an output cap (code + ≤3 short lines); v3 compressed SKILL.md from 115 to 95 lines. End state: ponytail writes 47 lines total, caveman writes 117, and the no-skill baseline writes 293 on the same five tasks (caveman vs ponytail writeup, 2026-06-12).
  3. 2026-06-15-llama3.2-local.md — a negative result that the README doesn’t hide: on llama3.2:latest (3.2B Q4_K_M) via Ollama, “the LOC effect is inside the noise floor.” Per-run totals swing from 17% below baseline to 50% above it at temperature 0.7; the n=5 median comes out +26%, a separate n=3 median comes out −17%. The signal is lost in run-to-run variance on small local models; the ruleset “transfers poorly to small local models where the multi-step decision ladder isn’t reliably followed” (llama3.2 local writeup, 2026-06-15).

The control arm is caveman, a separate MIT-licensed prose-compression skill (it leaves code “normal” but trims narrative). Caveman is a real, comparable baseline, not a strawman.

What to watch

Risks and caveats

  1. The benchmark is small and developer-grade. The headline numbers are measured on five everyday tasks. The benchmarks/results/ write-ups are stronger evidence, but the v4 hardening suite is six tasks and the llama3.2 result is one model. No claim that “ponytail makes your agent 80–94% leaner on production codebases” is supported by the public data as of 2026-06-16.
  2. The numbers are vendor self-published. The benchmark is in the project’s own repo, on its own tasks, against its own choice of baseline. The reproducer is open (npx promptfoo eval -c benchmarks/promptfooconfig.yaml), but a third-party rerun on independent tasks has not appeared in the public sources cited.
  3. The “lazy senior dev” framing is a metaphor, not a quality claim. The benchmark measures code volume, cost, and latency — not code quality, maintainability, or correctness on non-toy work. The correctness.js gate inside the benchmark checks that generated code runs, but it’s a smoke test, not a property check.
  4. The ruleset assumes the developer understands trust boundaries. “Trust-boundary validation, data-loss handling, security, and accessibility are never on the chopping block” — that is a discipline the agent is told to follow, not a guarantee the user can offload onto it. Security- or accessibility-critical work should still get a human review pass.
  5. Plugin vs. instruction-only paths behave differently. Skill-capable hosts (Claude Code, Codex, OpenCode, Gemini, Pi) get the /ponytail commands and the four mode levels. Cursor, Windsurf, Cline, Copilot, Kiro, and Antigravity load the always-on ruleset without the commands — i.e., no mode switching, no /ponytail-review, no /ponytail-debt. The README is explicit: “The instruction-only adapters (Cursor, Windsurf, Cline, Copilot, Kiro, Antigravity) load the always-on ruleset without the commands” (ponytail README, 2026-06-16).
  6. The Claude Code / Codex plugins need Node.js on PATH. “The Claude Code and Codex plugins run two tiny Node.js lifecycle hooks, so node needs to be on your PATH” — Nix and nvm users need the non-interactive shell’s PATH. “If it isn’t, the skills still work, the always-on activation just stays quiet instead of erroring on every prompt” (ponytail README, 2026-06-16). This is a footgun, not a bug.
  7. The 17,896-star figure is fast-moving. Re-verify at the time you read this. Stars in this band can move by the thousand per day for a few weeks after a viral launch.
  8. 0 security advisories as of 2026-06-16 — the repo’s security and quality tab shows none. That’s the only safe claim to make. The ruleset is plain text and the plugin is small Node.js glue; supply-chain risk is bounded.
  9. ponytail: comments are a discipline, not a magic property. The shortcut tags are the team’s job to harvest and act on. Without /ponytail-debt (or an equivalent habit), deferred shortcuts become permanent.
  10. No CLA / governance signal on the project. The project is a single-author personal repo from a new GitHub account; there is no organizational backing, no CoC, and no contribution guide visible in the README. Treat it as a high-velocity single-maintainer project, not a community of record.

Practical advice for builders

If you use Claude Code or Codex (the two hosts with the best path):

/plugin marketplace add DietrichGebert/ponytail
/plugin install ponytail@ponytail

Then trust the two lifecycle hooks in /hooks and start a new thread. Set PONYTAIL_DEFAULT_MODE=full (the default) for everyday work, ultra for cleanup passes on a stale codebase, and off for security-sensitive or accessibility-critical code (ponytail README, 2026-06-16).

If you use OpenCode (this site runs on it), drop "plugin": ["./.opencode/plugins/ponytail.mjs"] into opencode.json in your project, or copy the AGENTS.md to your project root for a zero-install always-on path.

If you use Cursor, Windsurf, Cline, or Aider, copy the matching rules file from the repo (/.cursor/rules/, /.windsurf/rules/, /.clinerules/, or /AGENTS.md) into your project. The rules will load as always-on context; you will not get the mode-switch commands.

If you want to verify the gain is real for your workload, the README ships the reproducer:

npx promptfoo eval -c benchmarks/promptfooconfig.yaml --repeat 10

Replace the five benchmark tasks with five samples from your own codebase (one debounce, one validator, one rate-limiter, etc.) and rerun on main before installing ponytail and after. The README’s local-model writeup adds the caveat: “raise --repeat (or lower the sampling temperature) before reading anything into the totals” — n=1 is noise on small models (llama3.2 local writeup, 2026-06-15).

When to switch to off: security-critical authentication, cryptography, and authorization paths; accessibility-critical UI work; any code where a missing upgrade path is a CVE in waiting. The README’s exclusion list is short and worth keeping in your team’s coding guide.

For the team’s engineering manager: the /ponytail-debt ledger is the cheapest habit to adopt. Make it a CI step: if a ponytail: comment is older than 90 days without a follow-up, raise an issue. That converts a prompt-engineering tool into a maintenance discipline.

Verdict

A real, portable, MIT-licensed, evidence-backed answer to the most common complaint about AI coding agents. The headline numbers are scoped to a five-task developer-grade benchmark and the README tells you exactly how to re-run them; the benchmarks/results/ write-ups are stronger evidence than the headline, and the llama3.2 negative result is a model of intellectual honesty. Install it for your harness in under a minute, run the reproducer against your own five tasks, and keep the /ponytail-debt habit if the team is going to live with the ruleset for more than a sprint. Skip it only if the load-bearing work in your repo is on the explicit never-trim list — security, data loss, accessibility.