s2n-bignum-bench Leaderboard

Leaderboard for the s2n-bignum-bench Neural Theorem Proving benchmark, evaluating low-level code reasoning of LLMs. 2,301 (evolving) HOL Light proof problems derived from AWS s2n-bignum cryptographic proofs.

Viewing the Leaderboard

Leaderboard Webpage

Submitting to the Leaderboard

To get your model featured on the leaderboard:

Set up the benchmark — follow the s2n-bignum-bench GUIDE to generate problems and evaluate your model's answers locally.

Prepare your answers CSV — a CSV with the columns problem_id, category, query, answer:

problem_id,category,query,answer
arm.bignum_add.BIGNUM_ADD_CORRECT,functional_correctness_arm,"..."," REWRITE_TAC[...] THEN ..."

Zip your CSV — compress the CSV into a .zip file before submitting. Raw CSV submissions are not accepted to prevent data contamination of answers in public issue threads.
Submit via GitHub Issue — open a Leaderboard Submission issue with:
- Model name and URL (if applicable)
- Your name/org and profile link
- Whether the model is open or closed source
- Your answers .zip (drag and drop as an attachment)
- Any additional details (compute used, prompting strategy, etc.)
We evaluate and publish — we run your answers through the benchmark evaluation pipeline and add the results to the leaderboard.

Verdicts

Each problem receives one verdict:

Verdict	Meaning
OK	Proof succeeded without adding axioms
FAIL	Tactic failed to prove the goal
CHEATING	Solution added axioms (detected via axiom count comparison) or CHEAT_TAC detected
ERROR	Runtime exception or compilation error
TIMEOUT	Exceeded category timeout (default: 120s - configurable, please let us know your time)

Problem Categories

Category	Count	Description
`generic`	564	General HOL Light lemmas
`program_state`	555	Program state properties
`functional_correctness_arm`	439	ARM functional correctness
`functional_correctness_x86`	425	x86-64 functional correctness
`bit_vector`	318	Bit vector operations

Maintainers

Balaji Rao, Juneyoung Lee (@aqjune)

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

s2n-bignum-bench Leaderboard

Viewing the Leaderboard

Submitting to the Leaderboard

Verdicts

Problem Categories

Maintainers

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

s2n-bignum-bench Leaderboard

Viewing the Leaderboard

Submitting to the Leaderboard

Verdicts

Problem Categories

Maintainers

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages