Leaderboard for the s2n-bignum-bench Neural Theorem Proving benchmark, evaluating low-level code reasoning of LLMs. 2,301 (evolving) HOL Light proof problems derived from AWS s2n-bignum cryptographic proofs.
To get your model featured on the leaderboard:
-
Set up the benchmark — follow the s2n-bignum-bench GUIDE to generate problems and evaluate your model's answers locally.
-
Prepare your answers CSV — a CSV with the columns
problem_id, category, query, answer:problem_id,category,query,answer arm.bignum_add.BIGNUM_ADD_CORRECT,functional_correctness_arm,"..."," REWRITE_TAC[...] THEN ..." -
Zip your CSV — compress the CSV into a
.zipfile before submitting. Raw CSV submissions are not accepted to prevent data contamination of answers in public issue threads. -
Submit via GitHub Issue — open a Leaderboard Submission issue with:
- Model name and URL (if applicable)
- Your name/org and profile link
- Whether the model is open or closed source
- Your answers
.zip(drag and drop as an attachment) - Any additional details (compute used, prompting strategy, etc.)
-
We evaluate and publish — we run your answers through the benchmark evaluation pipeline and add the results to the leaderboard.
Each problem receives one verdict:
| Verdict | Meaning |
|---|---|
| OK | Proof succeeded without adding axioms |
| FAIL | Tactic failed to prove the goal |
| CHEATING | Solution added axioms (detected via axiom count comparison) or CHEAT_TAC detected |
| ERROR | Runtime exception or compilation error |
| TIMEOUT | Exceeded category timeout (default: 120s - configurable, please let us know your time) |
| Category | Count | Description |
|---|---|---|
generic |
564 | General HOL Light lemmas |
program_state |
555 | Program state properties |
functional_correctness_arm |
439 | ARM functional correctness |
functional_correctness_x86 |
425 | x86-64 functional correctness |
bit_vector |
318 | Bit vector operations |
Balaji Rao, Juneyoung Lee (@aqjune)