by cahlen in-progress

Hardware

8× NVIDIA B200 (183 GB VRAM each, 1.43 TB total) 2× Intel Xeon Platinum 8570 (112 cores / 224 threads) 2 TB DDR5 RAM
theorem-provingaisearch-algorithms b200dgx mctsllm-provingtree-searchformal-verification

Key Results

MCTS vs. Naive Sampling for LLM Theorem Proving

Abstract

In our previous experiment (Zaremba verification), we discovered that SOTA theorem-proving LLMs nail proof structure but fail at witness search — producing syntactically perfect proofs with wrong values across 80 attempts. This experiment tests whether Monte Carlo Tree Search (MCTS) over the tactic space fixes this bottleneck. We compare naive i.i.d. sampling (the standard approach) against UCB1-guided tree search on the same proof obligations, measuring success rate vs. compute budget.

Background

The Witness Search Problem

When proving a,  P(a)\exists a,\; P(a) in Lean 4, the LLM must find a concrete witness aa satisfying PP. In our Zaremba experiment:

  • Both Goedel-32B and Kimina-72B always generated the correct proof template: exact ⟨WITNESS, by decide, by decide, by native_decide, by native_decide⟩
  • But for d=9d = 9, both models tried a=1a = 1 and a=3a = 3 across 80 samples without ever trying the correct a=2a = 2
  • The failure mode is not reasoning — it’s exploration

Why MCTS Should Help

Naive sampling draws NN independent proofs from the LLM. If the correct witness has probability pp per sample, success probability is 1(1p)N1 - (1-p)^N. But LLMs have mode collapse — they repeat the same high-probability witnesses and never explore the tail.

MCTS addresses this by:

  1. Tracking which tactics failed — the UCB1 score deprioritizes repeatedly-failing branches
  2. Feeding failures back to the LLM — the prompt includes previous failed attempts
  3. Systematic exploration — UCB1’s exploration term forces trying new branches

SOTA Context

  • AlphaProof (DeepMind, 2024): Used AlphaZero-style MCTS to solve 4 out of 6 IMO 2024 problems. The search component was critical.
  • Seed Prover 1.5 (ByteDance, 2025): Best-first tree search with DPO-trained value function. 99.6% on MiniF2F.
  • BFS-Prover-V2 (ByteDance, 2025): Step-level prover with planner enhancement. 95.08% on MiniF2F with search.

All SOTA systems use search. Nobody has published clean ablations comparing search vs. no-search on the same model and benchmark. That’s what we measure.

Method

Test Suite

Zaremba’s Conjecture for d=1,,50d = 1, \ldots, 50 with A=5A = 5. Each theorem requires finding a witness aa coprime to dd with bounded CF quotients. Difficulty varies: d=1d = 1 is trivial (a=1a = 1), d=49d = 49 requires finding a non-obvious witness.

Comparison

MethodDescriptionBudget
NaiveSample NN independent proofs, accept first that compilesNN evals
MCTSUCB1 tree search, 8 children per expansion, backpropagationNN evals

Both methods use the same LLM (Goedel-32B or Kimina-72B), the same total Lean evaluations, and the same prompts (modulo MCTS feeding back failures). The only difference is whether we search or sample.

We sweep N{8,32,64,128,256,512}N \in \{8, 32, 64, 128, 256, 512\} and measure:

  • Success rate: fraction of 50 theorems proved
  • First-proof eval: how many evals until the first proof is found
  • Unique tactics tried: how many distinct proof attempts were generated

Models

Served on split GPU sets via vLLM:

ModelGPUsPort
Goedel-Prover-V2-32B0–38000
Kimina-Prover-72B4–78001

Results

PENDING — experiment not yet run.

Success Rate vs. Budget

Budget NNNaive (Goedel)MCTS (Goedel)Naive (Kimina)MCTS (Kimina)
8PENDINGPENDINGPENDINGPENDING
32PENDINGPENDINGPENDINGPENDING
64PENDINGPENDINGPENDINGPENDING
128PENDINGPENDINGPENDINGPENDING
256PENDINGPENDINGPENDINGPENDING
512PENDINGPENDINGPENDINGPENDING

Exploration Efficiency

Budget NNUnique Tactics (Naive)Unique Tactics (MCTS)
64PENDINGPENDING
256PENDINGPENDING

Case Study: d=9d = 9

The theorem that defeated both models in naive mode (correct witness a=2a = 2):

MethodBudget to proveWitnesses tried before success
Naive (Goedel)PENDINGPENDING
MCTS (Goedel)PENDINGPENDING

Analysis

PENDING — will analyze:

  1. Does MCTS consistently outperform naive sampling, or only on “hard” witnesses?
  2. What’s the scaling law: how does success rate grow with budget for each method?
  3. Is the MCTS exploration term (UCB1) actually necessary, or does just feeding back failures suffice?
  4. Does MCTS help more for larger models or smaller models?

Reproducibility

git clone https://github.com/cahlen/idontknow
cd idontknow

# Start model servers (requires downloaded weights)
CUDA_VISIBLE_DEVICES=0,1,2,3 ./scripts/serve-model.sh models/Goedel-Prover-V2-32B --tp 4 --port 8000 &
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/serve-model.sh models/Kimina-Prover-72B --tp 4 --port 8001 &

# Run MCTS prover
python scripts/experiments/mcts-proof-search/mcts_prover.py \
    --server http://localhost:8000 \
    --file lean4-proving/conjectures/zaremba.lean \
    --budget 256 --rollouts 8

# Run naive prover (for comparison)
python lean4-proving/prover.py \
    --server http://localhost:8000 \
    --file lean4-proving/conjectures/zaremba.lean \
    --max-attempts 32 --zaremba

Computed on NVIDIA DGX B200. Code: github.com/cahlen/idontknow.