1515
1616CodeClash is a benchmark for evaluating AI systems on ** goal-oriented software engineering** .
1717
18- Today's AI coding evals are * task* -oriented (e.g., HumanEval, SWE-bench).
18+ Today's AI coding evals are * task* -oriented (e.g.,
19+ <a href =" https://github.com/openai/human-eval " >HumanEval</a >, <a href =" https://swebench.com " >SWE-bench</a >).
1920Models are given explicit instructions.
20- We then verify implementations with unit tests.
21+ We then verify correctness with unit tests.
2122
2223But building software is fundamentally driven by goals ("improve user retention", "reduce costs", "increase revenue").
23- Reaching our goals is a self-directed, iterative, and often competitive process.
24+ Reaching our goals via code is a self-directed, iterative, and often competitive process.
2425To capture this dynamism of real software development, we introduce CodeClash!
2526
2627Check out our [ arXiv paper] ( https://arxiv.org/abs/2511.00839 ) and [ website] ( https://codeclash.ai/ ) for the full details!
@@ -35,6 +36,9 @@ $ pip install -e '.[dev]'
3536$ python main.py configs/test/battlesnake.yaml
3637```
3738
39+ > [ !TIP]
40+ > CodeClash requires Docker to create execution environments. CodeClash was developed and tested on Ubuntu 22.04.4 LTS.
41+
3842Once this works, you should be set up to run a real tournament!
3943To run * Claude Sonnet 4.5* against * o3* in a * BattleSnake* tournament with * 5 rounds* and * 1000 competition simulations* per round, run:
4044``` bash
0 commit comments