MCTS vs. Naive Sampling for LLM Theorem Proving
Abstract
In our previous experiment (Zaremba verification), we discovered that SOTA theorem-proving LLMs nail proof structure but fail at witness search — producing syntactically perfect proofs with wrong values across 80 attempts. This experiment tests whether Monte Carlo Tree Search (MCTS) over the tactic space fixes this bottleneck. We compare naive i.i.d. sampling (the standard approach) against UCB1-guided tree search on the same proof obligations, measuring success rate vs. compute budget.
Background
The Witness Search Problem
When proving in Lean 4, the LLM must find a concrete witness satisfying . In our Zaremba experiment:
- Both Goedel-32B and Kimina-72B always generated the correct proof template:
exact ⟨WITNESS, by decide, by decide, by native_decide, by native_decide⟩ - But for , both models tried and across 80 samples without ever trying the correct
- The failure mode is not reasoning — it’s exploration
Why MCTS Should Help
Naive sampling draws independent proofs from the LLM. If the correct witness has probability per sample, success probability is . But LLMs have mode collapse — they repeat the same high-probability witnesses and never explore the tail.
MCTS addresses this by:
- Tracking which tactics failed — the UCB1 score deprioritizes repeatedly-failing branches
- Feeding failures back to the LLM — the prompt includes previous failed attempts
- Systematic exploration — UCB1’s exploration term forces trying new branches
SOTA Context
- AlphaProof (DeepMind, 2024): Used AlphaZero-style MCTS to solve 4 out of 6 IMO 2024 problems. The search component was critical.
- Seed Prover 1.5 (ByteDance, 2025): Best-first tree search with DPO-trained value function. 99.6% on MiniF2F.
- BFS-Prover-V2 (ByteDance, 2025): Step-level prover with planner enhancement. 95.08% on MiniF2F with search.
All SOTA systems use search. Nobody has published clean ablations comparing search vs. no-search on the same model and benchmark. That’s what we measure.
Method
Test Suite
Zaremba’s Conjecture for with . Each theorem requires finding a witness coprime to with bounded CF quotients. Difficulty varies: is trivial (), requires finding a non-obvious witness.
Comparison
| Method | Description | Budget |
|---|---|---|
| Naive | Sample independent proofs, accept first that compiles | evals |
| MCTS | UCB1 tree search, 8 children per expansion, backpropagation | evals |
Both methods use the same LLM (Goedel-32B or Kimina-72B), the same total Lean evaluations, and the same prompts (modulo MCTS feeding back failures). The only difference is whether we search or sample.
We sweep and measure:
- Success rate: fraction of 50 theorems proved
- First-proof eval: how many evals until the first proof is found
- Unique tactics tried: how many distinct proof attempts were generated
Models
Served on split GPU sets via vLLM:
| Model | GPUs | Port |
|---|---|---|
| Goedel-Prover-V2-32B | 0–3 | 8000 |
| Kimina-Prover-72B | 4–7 | 8001 |
Results
PENDING — experiment not yet run.
Success Rate vs. Budget
| Budget | Naive (Goedel) | MCTS (Goedel) | Naive (Kimina) | MCTS (Kimina) |
|---|---|---|---|---|
| 8 | PENDING | PENDING | PENDING | PENDING |
| 32 | PENDING | PENDING | PENDING | PENDING |
| 64 | PENDING | PENDING | PENDING | PENDING |
| 128 | PENDING | PENDING | PENDING | PENDING |
| 256 | PENDING | PENDING | PENDING | PENDING |
| 512 | PENDING | PENDING | PENDING | PENDING |
Exploration Efficiency
| Budget | Unique Tactics (Naive) | Unique Tactics (MCTS) |
|---|---|---|
| 64 | PENDING | PENDING |
| 256 | PENDING | PENDING |
Case Study:
The theorem that defeated both models in naive mode (correct witness ):
| Method | Budget to prove | Witnesses tried before success |
|---|---|---|
| Naive (Goedel) | PENDING | PENDING |
| MCTS (Goedel) | PENDING | PENDING |
Analysis
PENDING — will analyze:
- Does MCTS consistently outperform naive sampling, or only on “hard” witnesses?
- What’s the scaling law: how does success rate grow with budget for each method?
- Is the MCTS exploration term (UCB1) actually necessary, or does just feeding back failures suffice?
- Does MCTS help more for larger models or smaller models?
Reproducibility
git clone https://github.com/cahlen/idontknow
cd idontknow
# Start model servers (requires downloaded weights)
CUDA_VISIBLE_DEVICES=0,1,2,3 ./scripts/serve-model.sh models/Goedel-Prover-V2-32B --tp 4 --port 8000 &
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/serve-model.sh models/Kimina-Prover-72B --tp 4 --port 8001 &
# Run MCTS prover
python scripts/experiments/mcts-proof-search/mcts_prover.py \
--server http://localhost:8000 \
--file lean4-proving/conjectures/zaremba.lean \
--budget 256 --rollouts 8
# Run naive prover (for comparison)
python lean4-proving/prover.py \
--server http://localhost:8000 \
--file lean4-proving/conjectures/zaremba.lean \
--max-attempts 32 --zaremba
Computed on NVIDIA DGX B200. Code: github.com/cahlen/idontknow.