Skip to content

Commit 4a67e23

Browse files
committed
Add Claude Opus 4.5 to PR Banchmark
1 parent 3ce4780 commit 4a67e23

File tree

1 file changed

+29
-6
lines changed

1 file changed

+29
-6
lines changed

docs/docs/pr_benchmark/index.md

Lines changed: 29 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -166,6 +166,12 @@ A list of the models used for generating the baseline suggestions, and example r
166166
<td style="text-align:left;"></td>
167167
<td style="text-align:center;"><b>32.4</b></td>
168168
</tr>
169+
<tr>
170+
<td style="text-align:left;">Claude-opus-4.5</td>
171+
<td style="text-align:left;">2025-11-01</td>
172+
<td style="text-align:left;">high</td>
173+
<td style="text-align:center;"><b>30.3</b></td>
174+
</tr>
169175
<tr>
170176
<td style="text-align:left;">GPT-4.1</td>
171177
<td style="text-align:left;">2025-04-14</td>
@@ -423,17 +429,34 @@ Final score: **32.8**
423429

424430
Strengths:
425431

426-
- **Focused and concise fixes:** When the model does detect a problem it usually proposes a minimal, well-scoped patch that compiles and directly addresses the defect without unnecessary noise.
427-
- **Good critical-bug instinct:** It often prioritises show-stoppers (compile failures, crashes, security issues) over cosmetic matters and occasionally spots subtle issues that all other reviewers miss.
428-
- **Clear explanations & snippets:** Explanations are short, readable and paired with ready-to-paste code, making the advice easy to apply.
432+
- **Focused and concise fixes:** When the model does detect a problem it usually proposes a minimal, well-scoped patch that compiles and directly addresses the defect without unnecessary noise.
433+
- **Good critical-bug instinct:** It often prioritises show-stoppers (compile failures, crashes, security issues) over cosmetic matters and occasionally spots subtle issues that all other reviewers miss.
434+
- **Clear explanations & snippets:** Explanations are short, readable and paired with ready-to-paste code, making the advice easy to apply.
429435

430436
Weaknesses:
431437

432-
- **High miss rate:** In a large fraction of examples the model returned an empty list or covered only one minor issue while overlooking more serious newly-introduced bugs.
433-
- **Inconsistent accuracy:** A noticeable subset of answers contain wrong or even harmful fixes (e.g., removing valid flags, creating compile errors, re-introducing bugs).
434-
- **Limited breadth:** Even when it finds a real defect it rarely reports additional related problems that peers catch, leading to partial reviews.
438+
- **High miss rate:** In a large fraction of examples the model returned an empty list or covered only one minor issue while overlooking more serious newly-introduced bugs.
439+
- **Inconsistent accuracy:** A noticeable subset of answers contain wrong or even harmful fixes (e.g., removing valid flags, creating compile errors, re-introducing bugs).
440+
- **Limited breadth:** Even when it finds a real defect it rarely reports additional related problems that peers catch, leading to partial reviews.
435441
- **Occasional guideline slips:** A few replies modify unchanged lines, suggest new imports, or duplicate suggestions, showing imperfect compliance with instructions.
436442

443+
### Claude-Opus-4.5 (high thinking budget)
444+
445+
Final score: **30.3**
446+
447+
Strengths:
448+
449+
- **High rule compliance & formatting:** Consistently produces valid YAML, respects the ≤3-suggestion limit, and usually confines edits to added lines, avoiding many guideline violations seen in peers.
450+
- **Low false-positive rate:** Tends to stay silent unless convinced of a real problem; when the diff is a pure version bump / docs tweak it often (correctly) returns an empty list, beating noisier baselines.
451+
- **Clear, focused patches when it fires:** In the minority of cases where it does spot a bug, it explains the issue crisply and supplies concise, copy-paste-able code snippets.
452+
453+
Weaknesses:
454+
455+
- **Very low recall:** In the vast majority of examples it misses obvious critical issues or suggests only a subset, frequently returning an empty list; this places it below most baselines on overall usefulness.
456+
- **Shallow coverage:** Even when it catches a defect it typically lists a single point and overlooks other high-impact problems present in the same diff.
457+
- **Occasional incorrect or incomplete fixes:** A non-trivial number of suggestions are wrong, compile-breaking, duplicate unchanged code, or touch out-of-scope lines, reducing trust.
458+
- **Inconsistent severity tagging & duplication:** Sometimes mis-labels critical vs general, repeats the same suggestion, or leaves `improved_code` blocks empty.
459+
437460
## Appendix - Example Results
438461

439462
Some examples of benchmarked PRs and their results:

0 commit comments

Comments
 (0)