Mistral opens Leanstral 1.5: 6B-active Apache-2.0 Lean 4 prover

mistralleanstralproof-engineeringformal-verificationlean-4open-source+4
Leanstral 1.5 benchmark summary showing PutnamBench 587/672, FATE-H 87%, and FATE-X 34% compared against Seed-Prover 1.5 high, Goedel-Architect, A-ProverBase, and Aleph Prover (Mistral AI blog, 2026-07-02).
Screenshot of Mistral AI blog post "Leanstral 1.5: Proof Abundance for All", captured 2026-07-04 from https://mistral.ai/news/leanstral-1-5/ via Playwright Chromium through the AI Newsroom browser helper. License: no license stated; editorial use of a single chart image from a public blog post for news commentary.

Mistral AI released Leanstral 1.5 on 2026-07-02 — an Apache-2.0, 119B/6B-active MoE for proof engineering in Lean 4. It saturates miniF2F, solves 587 of 672 PutnamBench problems, sets SOTA on FATE-H (87%) and FATE-X (34%), and lifts FLTEval pass@8 from 31.9 to 43.2 — past Opus 4.6’s 39.6 at ~1/7 the cost. Weights are on Hugging Face; a free leanstral-1-5 API is live.

Leanstral 1.5 benchmark summary: PutnamBench 587/672, FATE-H 87%, FATE-X 34%

What happened

Leanstral 1.5 is a 119B/6B-active MoE — small enough for one consumer node, large enough to learn deep proof structure (Mistral blog, 2026-07-02). Training runs mid-training → SFT → RL with CISPO in two custom environments that return feedback from a real Lean compiler.

The multiturn Lean verifier gives the model a theorem, returns the Lean compiler’s verdict, and loops until the proof compiles. The code agent environment gives it a filesystem, bash, and the Lean language server; it edits files, builds auxiliary lemmas, and survives many rounds of context compaction. Final proofs are checked by Mistral’s fork of SafeVerify.

Why it matters

  1. Open weights. Apache-2.0 means anyone can run, fine-tune, or audit; Seed-Prover 1.5 and Aleph Prover are API-only.
  2. Consumer hardware. 6B active parameters fit on one high-memory workstation; Mistral puts Seed-Prover 1.5 high at 10 H20-days per problem.
  3. A working install path via Mistral Vibe → /leanstallvibe --agent lean, plus an optional Lean LSP MCP.
  4. FLTEval is now open sourcereleased alongside the model as a community yardstick for real-PR proof engineering.

Benchmarks and cost

All numbers from the Mistral blog post, 2026-07-02.

Benchmark Leanstral 1.5 Best open Best closed
miniF2F (val + test) 100% / 100% n/a n/a
PutnamBench (of 672) 587 A-ProverBase 365 Aleph Prover 668; Seed-Prover 1.5 high 580
FATE-H 87 (SOTA) A-ProverBase 66 Seed-Prover 1.5 high 80
FATE-X 34 (SOTA) A-ProverBase 24 Seed-Prover 1.5 high 33
FLTEval pass@1 / pass@8 28.9 / 43.2 21.9 / 31.9 Opus 4.6: 39.6 pass@8 (~7× the cost)

Cost framing (vendor estimates). Mistral reports ~$4 per PutnamBench problem for Leanstral 1.5, $300+ per problem for Seed-Prover 1.5 high, $54–68 per problem for Aleph Prover. No closed prover publishes per-problem cost independently.

Test-time scaling. PutnamBench Pass@8 climbs monotonically with per-attempt tokens: 44 at 50k → 244 at 200k → 493 at 1M → 587 at 4M. Not plateauing at 4M.

PutnamBench Pass@8 vs token budget — 44 problems at 50k tokens, 587 at 4M tokens

Real-world bug discovery

Mistral ran a verification pipeline against 57 Rust repos using Aeneas (Rust → Lean), intent inference, and 4 attempts to prove + 4 to disprove each generated property. Result: 47 violated properties flagged, 11 genuine bugs, 5 previously unreported on GitHub (no CVEs at the time of writing).

Practical implications for builders

uv tool install mistral-vibe && uv tool update mistral-vibe && vibe --setup
/leanstall
exit
vibe --agent lean
# Optional: Lean LSP MCP in ~/.vibe/config.toml

Risks and caveats

  1. Benchmark-vs-real-world gap. The 5-of-11 true-bug rate is a useful signal, not a guarantee.
  2. Lean 4 only. No Coq, Isabelle, Rocq, Agda, or HOL Light support.
  3. Vendor cost estimates. The $4 vs $300+ vs $54–68 framing is from Mistral; closed provers do not publish per-problem cost independently.
  4. No CVEs on the 5 unreported bugs. Do not claim CVE ids in derivative coverage.
  5. LeanstralSafeVerify is a fork of SafeVerify; may not generalise to Lean projects with custom build setups.
  6. PutnamBench is more crowded than it looks. Aleph Prover solves 668/672 at higher per-problem cost; some higher-ranked results use NL proof hints that Leanstral 1.5 does not.

What to watch

Sources