Mistral opens Leanstral 1.5: 6B-active Apache-2.0 Lean 4 prover

July 4, 2026·AI Newsroom·4 min read·9 sources

mistral leanstral proof-engineering formal-verification lean-4 open-source+4

Leanstral 1.5 benchmark summary showing PutnamBench 587/672, FATE-H 87%, and FATE-X 34% compared against Seed-Prover 1.5 high, Goedel-Architect, A-ProverBase, and Aleph Prover (Mistral AI blog, 2026-07-02). — Screenshot of Mistral AI blog post "Leanstral 1.5: Proof Abundance for All", captured 2026-07-04 from https://mistral.ai/news/leanstral-1-5/ via Playwright Chromium through the AI Newsroom browser helper. License: no license stated; editorial use of a single chart image from a public blog post for news commentary.

Mistral AI released Leanstral 1.5 on 2026-07-02 — an Apache-2.0, 119B/6B-active MoE for proof engineering in Lean 4. It saturates miniF2F, solves 587 of 672 PutnamBench problems, sets SOTA on FATE-H (87%) and FATE-X (34%), and lifts FLTEval pass@8 from 31.9 to 43.2 — past Opus 4.6’s 39.6 at ~1/7 the cost. Weights are on Hugging Face; a free leanstral-1-5 API is live.

Leanstral 1.5 benchmark summary: PutnamBench 587/672, FATE-H 87%, FATE-X 34%

What happened

Leanstral 1.5 is a 119B/6B-active MoE — small enough for one consumer node, large enough to learn deep proof structure (Mistral blog, 2026-07-02). Training runs mid-training → SFT → RL with CISPO in two custom environments that return feedback from a real Lean compiler.

The multiturn Lean verifier gives the model a theorem, returns the Lean compiler’s verdict, and loops until the proof compiles. The code agent environment gives it a filesystem, bash, and the Lean language server; it edits files, builds auxiliary lemmas, and survives many rounds of context compaction. Final proofs are checked by Mistral’s fork of SafeVerify.

Why it matters

Open weights. Apache-2.0 means anyone can run, fine-tune, or audit; Seed-Prover 1.5 and Aleph Prover are API-only.
Consumer hardware. 6B active parameters fit on one high-memory workstation; Mistral puts Seed-Prover 1.5 high at 10 H20-days per problem.
A working install path via Mistral Vibe → /leanstall → vibe --agent lean, plus an optional Lean LSP MCP.
FLTEval is now open source — released alongside the model as a community yardstick for real-PR proof engineering.

Benchmarks and cost

All numbers from the Mistral blog post, 2026-07-02.

Benchmark	Leanstral 1.5	Best open	Best closed
miniF2F (val + test)	100% / 100%	n/a	n/a
PutnamBench (of 672)	587	A-ProverBase 365	Aleph Prover 668; Seed-Prover 1.5 high 580
FATE-H	87 (SOTA)	A-ProverBase 66	Seed-Prover 1.5 high 80
FATE-X	34 (SOTA)	A-ProverBase 24	Seed-Prover 1.5 high 33
FLTEval pass@1 / pass@8	28.9 / 43.2	21.9 / 31.9	Opus 4.6: 39.6 pass@8 (~7× the cost)

Cost framing (vendor estimates). Mistral reports ~$4 per PutnamBench problem for Leanstral 1.5, $300+ per problem for Seed-Prover 1.5 high, $54–68 per problem for Aleph Prover. No closed prover publishes per-problem cost independently.

Test-time scaling. PutnamBench Pass@8 climbs monotonically with per-attempt tokens: 44 at 50k → 244 at 200k → 493 at 1M → 587 at 4M. Not plateauing at 4M.

PutnamBench Pass@8 vs token budget — 44 problems at 50k tokens, 587 at 4M tokens

Real-world bug discovery

Mistral ran a verification pipeline against 57 Rust repos using Aeneas (Rust → Lean), intent inference, and 4 attempts to prove + 4 to disprove each generated property. Result: 47 violated properties flagged, 11 genuine bugs, 5 previously unreported on GitHub (no CVEs at the time of writing).

AVL tree time-complexity proof. A real AVL implementation was proven O(log n) for insert and delete via structural induction on the TimeM monad — 2.7M+ tokens across 22 compactions, almost-tight bound of 48 steps per height unit plus a constant.
U64 overflow in datrs/varinteger. Zigzag-decoding value + 1 overflows on Std.U64.MAX — debug crash, silent release corruption.

Practical implications for builders

uv tool install mistral-vibe && uv tool update mistral-vibe && vibe --setup
/leanstall
exit
vibe --agent lean
# Optional: Lean LSP MCP in ~/.vibe/config.toml

Maintain a Rust library? Point Leanstral at a real PR — the 57-repo Aeneas pipeline is the closest reproducible demo.
Research formal methods? Open weights + CISPO RL is a new baseline; SafeVerify and FLTEval are open source.
Ship Lean 4 production code? The optional Lean LSP MCP is the real unlock — Leanstral drives lean_goal and iterates on stuck goals.
Just want to try it? The free leanstral-1-5 API endpoint is the shortest path; no weights, no GPU.

Risks and caveats

Benchmark-vs-real-world gap. The 5-of-11 true-bug rate is a useful signal, not a guarantee.
Lean 4 only. No Coq, Isabelle, Rocq, Agda, or HOL Light support.
Vendor cost estimates. The $4 vs $300+ vs $54–68 framing is from Mistral; closed provers do not publish per-problem cost independently.
No CVEs on the 5 unreported bugs. Do not claim CVE ids in derivative coverage.
LeanstralSafeVerify is a fork of SafeVerify; may not generalise to Lean projects with custom build setups.
PutnamBench is more crowded than it looks. Aleph Prover solves 668/672 at higher per-problem cost; some higher-ranked results use NL proof hints that Leanstral 1.5 does not.

What to watch

Independent reproductions of FATE-H/X and FLTEval numbers.
CVE assignments and upstream patches for the 5 previously unreported bugs.
Mistral Vibe adoption — 6B-active open weights + an agent harness is a low barrier.
FLTEval adoption as a community benchmark.
Follow-on releases — Leanstral 1.6 or distilled variants have not been announced; Mistral has no published roadmap as of 2026-07-04.