oh-my-pi: a fork that turned a coding agent into a harness-engineering platform

oh-my-piompai-coding-agentai-agentharnessharness-engineering+25high-risk claims
Screenshot of the oh-my-pi GitHub README showing the 'Every tool, benchmaxed' table with four model rows (Grok Code Fast 1 6.7% to 68.3%, Gemini 3 Flash +5 pp, Grok 4 Fast minus 61 percent tokens, MiniMax 2.1x), the 'The Pi you love, with batteries included' heading, the '01 Code execution w/ tool-calling' section intro, and the fork provenance line 'Originally built on Mario Zechner's wonderful Pi, omp adds everything you're missing.'
Source: github.com/can1357/oh-my-pi/blob/main/README.md · Captured 2026-06-26 via Playwright Chromium · License: MIT (project) / screenshot used editorially for the oh-my-pi article

On 2026-06-26, the GitHub repository can1357/oh-my-pi — a fork of Mario Zechner’s pi-mono, re-engineered into a full-featured terminal AI coding agent called omp — sat at 14,677 stars, 1,287 forks, 369 open issues, 10,671 commits, MIT-licensed, created 2025-12-31, with main last pushed 2026-06-26 05:39 UTC (repository metadata, 2026-06-26; README, 2026-06-26). The headline number in the README is the harness claim, not the model claim: “40+ providers · 32 built-in tools · 14 lsp ops · 28 dap ops · ~55k lines of Rust core” (README, 2026-06-26). In the 18 hours before publication, three releases shipped — v16.1.19 on 2026-06-25 11:30 UTC, v16.1.20 on 2026-06-25 21:01 UTC, v16.1.21 on 2026-06-26 05:55 UTC — a release every six to nine hours, with macOS / Linux / Windows binaries on every tag (Releases API, 2026-06-26). That cadence, more than any single feature, is the signal worth understanding.

What it is

A terminal coding agent, not a wrapper. omp ships as a single Rust binary on macOS, Linux, and Windows, with no WSL bridge on Windows (README — Install, 2026-06-26). The install is one line — curl -fsSL https://omp.sh/install | sh on Unix, irm https://omp.sh/install.ps1 | iex on Windows, brew install can1357/tap/omp on Homebrew, or bun install -g @oh-my-pi/pi-coding-agent for the npm track — and the binary reaches ~118–162 MB per platform (README, 2026-06-26; v16.1.21 release assets, 2026-06-26). Bun ≥ 1.3.14 is the only runtime requirement. The same engine drives four entry points: the interactive TUI (omp), one-shot prompt (omp -p), the Node SDK (@oh-my-pi/pi-coding-agent), and stdio RPC / ACP for editor embedding (README, 2026-06-26).

The fork relationship. oh-my-pi is a fork of Mario Zechner’s Pi — the same project previously hosted at badlogic/pi-mono (the README still links the old URL; it 301-redirects to the new home) (badlogic/pi-mono redirect verified 2026-06-26). The LICENSE file carries a dual copyright line: “© 2025 Mario Zechner / © 2025-2026 Can Bölük” (LICENSE, 2026-06-26). The README is direct about what is added: “Originally built on Mario Zechner’s wonderful Pi, omp adds everything you’re missing.” (README, 2026-06-26). Upstream Pi is itself substantial — 65.7k stars, MIT, 240 releases, v0.80.2 latest on 2026-06-23 (earendil-works/pi, 2026-06-26). What omp does is not replace Pi; it is the harness that the original author and the community have spent ~1,300 commits building on top of it. Maintainer Can Bölük frames the project’s own thesis on his blog: “The model is the moat. The harness is the bridge. Burning bridges just means fewer people bother to cross.” (blog.can.ac — The Harness Problem, 2026-02-12).

The four numbers, in context. 40+ providers is the menu of LLM backends — Anthropic, OpenAI, Google Gemini, xAI Grok, Mistral, Groq, Cerebras, Fireworks, Together, Hugging Face, NVIDIA NIM, OpenRouter, Synthetic, Vercel AI Gateway, Cloudflare AI Gateway, Perplexity, Ollama, Ollama Cloud, LM Studio, llama.cpp, vLLM, LiteLLM, plus coding-plan routes through Cursor, GitHub Copilot, GitLab Duo, Kimi Code, Moonshot, the MiniMax Coding Plan and the MiniMax Coding Plan CN, Alibaba Coding Plan, Qwen Portal, Z.AI / GLM Coding Plan, Xiaomi MiMo, Qianfan, NanoGPT, Venice, Kilo, ZenMux, OpenCode Go, and OpenCode Zen (README — Forty-plus providers, 2026-06-26). 32 built-in tools is the agent surface — read, write, edit, ast_edit, ast_grep, search, find, bash, eval, ssh, lsp, debug, task, irc, todo, job, ask, browser, web_search, github, generate_image, inspect_image, tts, checkpoint, rewind, retain, recall, reflect, resolve, search_tool_bm25, plus the omp completions family — and 14 LSP operations + 28 DAP operations mean the same tool name (lsp, debug) reaches every operation a real IDE exposes (README — Tools, 2026-06-26). ~55k lines of Rust is the size of the N-API addon and the four embedded crates that replace the usual fork-exec plumbing (README — Rust core breakdown, 2026-06-26).

Why it matters

The harness is the boundary between “the model knows what to change” and “the change is on disk”. Most agent projects leave that boundary to a thin wrapper around read, str_replace, and a bash tool. omp treats it as a product surface. Four mechanisms carry the load.

1 · Hashline — edit by content hash. This is the load-bearing invention. The model never retypes the lines it wants to change; the harness tags every read line with a 2-3 hex character content hash, and edits reference those tags. A file hello.ts is read back as:

11:a3|function hello() {
22:f1|  return "world";
33:0e|}

A patch then looks like:

[hello.ts#tag]
SWAP 1.=1:
+const greeting = "hello";

The [hello.ts#tag] is a full-file content hash recorded by a SnapshotStore; the patcher resolves it, verifies the live file still matches, and rejects the patch before it corrupts anything if the file has changed (hashline README, 2026-06-26). The README’s “Every tool, benchmaxed” table publishes the project’s own 540-task, 16-model benchmark (3 runs per task, 180 tasks per run, fresh agent session each time, four tools — read, edit, write):

model metric what
Grok Code Fast 1 6.7% → 68.3% Tenfold lift the moment the edit format stops eating the model alive.
Gemini 3 Flash +5 pp Over str_replace — beats Google’s own best attempt at the format.
Grok 4 Fast −61% tokens Output collapses once the retry loop on bad diffs disappears.
MiniMax 2.1× Pass rate more than doubles. Same weights, same prompt.

(README — Every tool, 2026-06-26; harness-problem blog post, 2026-02-12). All four numbers are project-published and self-attributed. The author’s blog post names the alternative formats and the cost — “Grok 4’s patch failure rate in my benchmark was 50.7%, GLM-4.7’s was 46.2%”, total benchmark cost “~$300” — and the editorial claim of the README is “You can blame the pilot for the landing gear” (harness-problem, 2026-02-12). The article does not promote these to independent benchmarks.

2 · First-class subagents with schema-validated yield. The task tool fans out into isolated worktrees, each worker runs its own tool surface, and the final yield is a schema-validated object the parent reads directly“No prose to parse, no merge conflicts between siblings, no orphaned edits” (README — 05 First-class subagents, 2026-06-26). The agent runtime underneath is @oh-my-pi/pi-agent, an Agent class that emits a typed event stream (agent_start, turn_start, message_start/update/end, tool_execution_start/update/end, turn_end, agent_end) with transformContext and convertToLlm as the two pipeline seams between AgentMessage[] and the LLM’s Message[] (pi-agent README, 2026-06-26). v16.1.16 added isolated, apply, and merge options to eval agent() across the Python, JavaScript, Ruby, and Julia runtimes, “so workflowz-driven fan-outs can request the same copy-on-write worktree isolation the task tool offers” (v16.1.16 release notes, 2026-06-23). The README’s primary subagent example shows two workers (ComponentsExports and RoutesExports) producing a typed Findings object back to the parent — that is the contract that makes the rest of the system feel native rather than glued (README, 2026-06-26).

3 · Time-traveling stream rules. A regex match against the streaming model output aborts the stream mid-token, injects the rule as a system reminder, and retries from the same point (README — 04 Time-traveling stream rules, 2026-06-26). The README’s example is concrete: the model is reading src.rs and about to write Box::leak; the request aborts with Error: Request was aborted, an amber ⚠ Injecting rule: box-leak card injects “Don’t reach for Box::leak in production code paths”, and the agent course-corrects to Arc<str> (README, 2026-06-26). The README is direct about why the rule rides the stream rather than the system prompt: “You get course-correction without paying context tax on every turn. Injections survive compaction, so the fix sticks.” (README, 2026-06-26). For builders, this is the most portable idea in the project — it is not a model-side mechanism, it is a stream-side mechanism, and it composes with any provider.

4 · The advisor model. A second model on the advisor role reads every turn the main agent takes, with its own context and its own model, and injects notes inline — “a quiet aside, a concern, or a hard blocker” (README — 06 A second model, 2026-06-26). The example in the README is the kind of miss the advisor is built to catch: the main agent scopes a catch to ENOENT instead of swallowing every error, and the advisor’s 1 note (concern) warns that “the fix no longer matches the user’s literal acceptance criterion”. The advisor is the dual of the stream rule: the stream rule catches mechanical departures from policy, the advisor catches semantic drift from intent. Together they are the project’s answer to “the model is flaky at expressing itself”.

5 · Unapologetically native, even on Windows. Most agents shell out to rg, grep, find, and bash — fork-exec on every call, broken on machines where the binary does not exist. omp links the real implementations into the process: ripgrep, glob, find are in-process, brush is the bash with sessions that survive across calls, and the same binary runs on macOS, Linux, and Windows with no WSL bridge (README — 09 Unapologetically native, 2026-06-26). The README’s per-module Rust breakdown is worth reading: shell 3,700 LoC, grep 1,900, keys 1,490, text 1,450, summary 1,040 (tree-sitter structural source summaries), ast 1,000 (ast-grep-core), fs_cache 840, highlight 470, pty 455, glob 410, workspace 385, appearance 270 (Mode 2031 + macOS CoreFoundation FFI), power 270 (macOS IOKit power-assertion), task 260 (libuv), fd 250, iso 245, prof 240, ps 195, clipboard 80, tokens 65 (O200k + Cl100k BPE), sixel 55, html 50 (README — Rust core breakdown, 2026-06-26). The cumulative ~55k is real engineering, not a packaging number.

6 · Eight-format config import, native. “Every other agent ships an importer and expects you to convert. omp reads the eight formats already on disk in their native shape — Cursor MDC, Cline .clinerules, Codex AGENTS.md, Copilot applyTo, and the rest.” (README — 15 Inherits what your other tools already wrote, 2026-06-26). For teams with rules already on disk in another agent, the onboarding story is “drop the binary on the path,” not “rewrite your config.”

7 · The 18-hour release cadence is the operational signal. v16.1.19, v16.1.20, v16.1.21 in 18 hours is not normal for a CLI shipping five-platform native binaries (Releases, 2026-06-26). Reading the three changelogs is more useful than the headline stars: v16.1.20 alone covers a Ctrl+Z hanging fix that turned out to be a tokio SIGTSTP-listener hijack inside brush-core (#3461), a fix to the mise() shell function dying because __MISE_EXE was lost in the snapshot script (#3470), a fix to direct Anthropic Claude Sonnet/Haiku 4.5 calls crashing every call with HTTP 400 “This model does not support the effort parameter” (#3497), and a fix to the ollama-cloud three-request concurrency cap that was being silently violated (#3464). v16.1.21 added a fix to clipboard image-paste dropping image-file-only pasteboards as literal text on macOS — the diagnostic walked pbpaste(1) and the public.file-url representation down to the AppleScript bridge that now handles the Cmd+C-on-Finder case (#3506). These are the right fixes at the right depth. A project that ships them at this cadence is a project that is using its own binary.

Practical implications

For the engineer evaluating harnesses, the right question is not “which agent is best” — it is “which agent’s tool boundary is best.” omp’s answer is: tag reads with content hashes, route writes through workspace/willRenameFiles, fan out subagents with typed yields, attach an advisor to every turn, and patch the stream mid-token. None of those moves are model-side; they are harness-side, and the model’s apparent capability scales with them. If you are benchmarking an agent, run the same prompt against omp with three different edit formats — str_replace, apply_patch, and Hashline — and treat the spread as the harness contribution. The README’s published numbers are a guide to the order of magnitude, not a contract.

For a team replacing Claude Code or Codex CLI today, the migration cost is low. omp reads Cursor MDC, Cline .clinerules, Codex AGENTS.md, and Copilot applyTo in their native shape, so the ruleset that “your team wrote last quarter” still works tonight (README — 15 Inherits, 2026-06-26). The provider breadth means the same binary can run the same session against Anthropic, OpenAI, xAI, Google, Mistral, Groq, Ollama, or a local llama.cpp server, with retry.fallbackChains per role and round-robin credentials for users who have hit a single-key quota (README — Forty-plus providers, 2026-06-26). For self-host shape, the binary is a single platform-tagged N-API addon, and the bash tool embeds brush-shell (a vendored fork of brush) so there is no system bash dependency. The relay surface for /collab is the only third-party service the binary depends on at runtime.

For a researcher studying agent reliability, the load-bearing mechanisms are reproducible. Hashline is in packages/hashline and exported as a standalone patch language with a SnapshotStore and a Patcher (hashline README, 2026-06-26). The agent runtime is in packages/agent with typed events and explicit transformContext / convertToLlm seams (pi-agent README, 2026-06-26). The author’s blog post is the only public 540-task, 16-model × 3-edit-format matrix the article found, and the code is on disk under the repository (harness-problem, 2026-02-12; repository, 2026-06-26). If you are testing harness interventions, this is the most reproducible project in the public corpus as of 2026-06-26.

For a builder wiring agents into a larger product, the SDK and ACP entry points are real. The @oh-my-pi/pi-coding-agent package exposes ModelRegistry, SessionManager, createAgentSession, and discoverAuthStorage, and the session emits typed events over a Node-friendly EventEmitter interface (README — SDK, 2026-06-26). For non-Node hosts, omp --mode rpc exposes NDJSON commands and response/event frames; omp acp speaks the Agent Client Protocol over JSON-RPC, with bashterminal/create + terminal/output, readfs/read_text_file, and edit / bash writes gated by session/request_permission (README — ACP, 2026-06-26). Inside Zed, this means the same agent you drive from the terminal drives the editor’s buffer and save path; no bridge, no plugin, no second brain to keep in sync (README — 14 ACP, 2026-06-26).

Risks and caveats

The same picture that makes omp worth covering is what makes the caveats load-bearing.

  1. The “61% fewer tokens” / “10× lift” numbers are project-published and self-attributed. The 540-task edit benchmark (3 runs × 180 tasks × 16 models) is in the maintainer’s own repository and described in his own blog post. The README presents the table without a third-party reproduction; the article does not promote the numbers to independent benchmarks. Treat the spread as the order-of-magnitude the harness contribution can deliver, not as a bill-line guarantee. The 6.7% → 68.3% on Grok Code Fast 1 is the right example to quote because it is the most extreme — and the right next question is “what does it look like on the model you are actually paying for?”
  2. The my.omp.sh relay is a third-party dependency. The /collab flow routes live-session frames through a relay at my.omp.sh. The README is direct: “Frames are sealed client-side; the relay never sees your keys.” That protects content, not availability. A team running omp in production should know that collab is offline when the relay is offline, and the relay’s own status is not part of the public SLO. Worth a “What to watch” — if the relay becomes a paid tier or a paid add-on, that is a project-direction signal.
  3. Supply-chain surface is large. 40+ provider SDKs, 14 web-search backends, MCP client, ACP, Discord, four platform-tagged native binaries — every additional dependency is a blast radius. The repo’s patches/ directory is a quiet signal here: when a transitive dependency misbehaves, the project patches it. That is a working supply-chain posture, not a finished one, and a single compromised provider SDK can affect every install that uses it.
  4. 369 open issues on 10,671 commits is a velocity signal, not a stability signal. The brief flagged this; the numbers confirm it. The recent release churn — six tags in 72 hours including v16.1.14, v16.1.15, v16.1.16, v16.1.19, v16.1.20, v16.1.21 — suggests active breakage-then-fix loops, not silent rot. The risk is that the fix-rate outpaces the triage queue, and a 14k-star user base outpaces the maintainers.
  5. MIT is permissive; the maintenance commitment is not guaranteed. The dual copyright is honest (“© 2025 Mario Zechner / © 2025-2026 Can Bölük”, LICENSE, 2026-06-26), and the project is a fork of a separate, actively-released upstream (earendil-works/pi, v0.80.2 on 2026-06-23). A team that adopts omp should pin a release tag in production (the v16.1.21 binaries are SHA-256-verified on the release page) and watch the upstream Pi releases for breaking changes that the fork has not yet incorporated.
  6. The contribution model is a vouch system, not a community of record. CONTRIBUTING.md is direct: “Pull requests require a vouch. A PR whose author is not vouched (or is denounced) is closed automatically.” The format follows mitchellh/vouch, with .github/VOUCHED.td as the source of truth and !vouch / !denounce / !unvouch commands runnable by collaborators. That is a deliberate gate, not a bug — and it is part of why a project at this size moves this fast. The trade is that the project is one maintainer’s call, and a community-governance question that is normally settled in a CoC has no CODE_OF_CONDUCT.md in the repo root as of 2026-06-26. Flag the gap; do not refuse to cover the project on its account.
  7. The author’s own account of vendor friction is part of the story. The 2026-02-12 blog post includes two short paragraphs on the side: Anthropic blocking OpenCode (a Anthropic’s position ‘OpenCode reverse-engineered a private API’ is fair on their face) and Google disabling the author’s own Gemini account “for running a benchmark — the same one that showed Gemini 3 Flash hitting 78.3% with a novel technique that beats their best attempt at it by 5.0 pp”. The article reports both paragraphs as the author’s account and the vendor’s response as the author’s framing, not as an adjudicated fact. The factual content — the 5.0 pp lead over Google’s best attempt and the free-R-D framing — is project-attributed.
  8. The “Windows, no WSL” promise is real but with platform-specific sharp edges. The v16.1.15 changelog shows the fix: “Fixed MCP stdio servers failing on Windows when the launcher’s PATH walk can’t pin down a bare npx/yarn/pnpm-style shim” (#3250). The fix routes any unresolvable bare command through cmd.exe /d /s /c. The work is real, the fix is real, and the long tail of “Windows is not Linux” edge cases — UNC mounts that reject fs.access, locked-down shells, restricted parent processes that strip Bun.env.PATH — is also real. A team standardising on Windows should expect to be on the bleeding edge for another quarter.
  9. omp is one maintainer’s call. The dual copyright names one person as the active maintainer. The Discord is the public channel. There is no foundation, no advisory board, no CoC, no published roadmap. That is a fair trade for the velocity — but a team adopting omp should know who they are betting on.

What to watch

  1. An independent third-party reproduction of the 16-model × 3-edit-format matrix. The harness-problem blog post walks the methodology (540 tasks, 16 models, 3 edit formats, 3 runs per cell) and the maintainer states the benchmark cost (~$300). The benchmark code is in the repository under packages/coding-agent and the article does not have a third-party re-run. The next-cycle story is the first external reproduction — particularly on a model not in the original sixteen.
  2. The pi-mono upstream fork-merge cadence. Upstream Pi is at v0.80.2 on 2026-06-23 with 240 releases (earendil-works/pi, 2026-06-26). omp is a fork, not a vendor branch — a Pi release that lands a breaking change to the TUI or the agent runtime will land in omp as a merge, not a no-op. Watch the compare links in the oh-my-pi releases: every v16.1.x changelog is the merge surface in microcosm.
  3. The advisor model in real sessions. The README is direct that the advisor is a separate context, a separate model, and a per-turn inline note. The performance story is qualitative (the README’s “concern” example) — there is no published benchmark on advisor intervention rate, false-positive rate, or pass-rate delta. The next-cycle story is the first published advisor evaluation.
  4. /collab and the relay. The relay is a third-party service. If the relay becomes a paid tier, a paid add-on, or a closed beta, that is a direction signal. The README does not list an SLO for the relay, and the article does not invent one.
  5. The VOUCHED.td governance signal. The vouch system is the project’s contribution gate, and the denouncement list is intentionally public “so other projects can reuse our prior knowledge of bad actors” (CONTRIBUTING.md, 2026-06-26). The next-cycle signal is whether .github/VOUCHED.td becomes a multi-maintainer artefact or stays single-editor.
  6. The 14k→15k star crossing. The repo hit 14,677 stars on 2026-06-26 with 1,287 forks and 369 open issues. Stars in this band move by the thousand per week. Re-verify the headline number at the time of reading.
  7. The Rust core’s CVE surface. The README’s per-module breakdown names pi-natives, pi-shell, pi-ast, pi-iso, brush-core-vendored, and brush-builtins-vendored as the Rust crates (README, 2026-06-26). The vendored brush-* is a deliberate supply-chain choice — the project owns the fork. A CVE in brush upstream is a CVE the project will need to merge, and the project’s security policy is the next thing to verify if the project gains enterprise users.
  8. The v17 signals. The current version line is v16.1.x and the changelog scope has been sub-minor since mid-June 2026. A v17 cut would be a direction signal — the kind of change that is worth a follow-up article on its own.

Verdict

On 2026-06-26, can1357/oh-my-pi — the fork of Mario Zechner’s Pi now shipping as omp — sat at 14,677 stars, 1,287 forks, 369 open issues, 10,671 commits, MIT, created 2025-12-31, with a Rust core of ~55k lines, 40+ LLM providers, 32 built-in tools, 14 LSP operations, 28 DAP operations, and a project-published 16-model edit benchmark that puts the harness contribution on the table as a tenfold lift on Grok Code Fast 1, a +5pp lead over Google’s own best str_replace for Gemini 3 Flash, a 61% output-token cut for Grok 4 Fast, and a 2.1× pass-rate on MiniMax — same weights, same prompt (repository metadata, 2026-06-26; README, 2026-06-26; harness-problem blog, 2026-02-12; Releases, 2026-06-26). The 18-hour release cadence through 2026-06-25 / 2026-06-26 — three tags, three real bug fixes at the right depth — is the second load-bearing signal, after the 540-task benchmark: this is a project using its own binary, not shipping a demo. The contribution model is a vouch system, the my.omp.sh relay is the only third-party service the binary depends on at runtime, the 369 open issues are a velocity signal not a stability signal, and the maintainer’s name is on the LICENSE line. For AI Newsroom’s primary reader, the takeaway is the harness engineering claim — the model is the moat, the harness is the bridge — and a concrete, working, MIT-licensed demonstration of what a serious agent boundary looks like in 2026. A real, working, two-months-in-15k-stars AI coding agent that treats the harness as a product, on the 18-hour release cadence, with a public benchmark. Pin a release tag, read packages/hashline, and watch the relay.