You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/docs/pr_benchmark/index.md
+29-6Lines changed: 29 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -166,6 +166,12 @@ A list of the models used for generating the baseline suggestions, and example r
166
166
<td style="text-align:left;"></td>
167
167
<td style="text-align:center;"><b>32.4</b></td>
168
168
</tr>
169
+
<tr>
170
+
<td style="text-align:left;">Claude-opus-4.5</td>
171
+
<td style="text-align:left;">2025-11-01</td>
172
+
<td style="text-align:left;">high</td>
173
+
<td style="text-align:center;"><b>30.3</b></td>
174
+
</tr>
169
175
<tr>
170
176
<td style="text-align:left;">GPT-4.1</td>
171
177
<td style="text-align:left;">2025-04-14</td>
@@ -423,17 +429,34 @@ Final score: **32.8**
423
429
424
430
Strengths:
425
431
426
-
-**Focused and concise fixes:** When the model does detect a problem it usually proposes a minimal, well-scoped patch that compiles and directly addresses the defect without unnecessary noise.
427
-
-**Good critical-bug instinct:** It often prioritises show-stoppers (compile failures, crashes, security issues) over cosmetic matters and occasionally spots subtle issues that all other reviewers miss.
428
-
-**Clear explanations & snippets:** Explanations are short, readable and paired with ready-to-paste code, making the advice easy to apply.
432
+
-**Focused and concise fixes:** When the model does detect a problem it usually proposes a minimal, well-scoped patch that compiles and directly addresses the defect without unnecessary noise.
433
+
-**Good critical-bug instinct:** It often prioritises show-stoppers (compile failures, crashes, security issues) over cosmetic matters and occasionally spots subtle issues that all other reviewers miss.
434
+
-**Clear explanations & snippets:** Explanations are short, readable and paired with ready-to-paste code, making the advice easy to apply.
429
435
430
436
Weaknesses:
431
437
432
-
-**High miss rate:** In a large fraction of examples the model returned an empty list or covered only one minor issue while overlooking more serious newly-introduced bugs.
433
-
-**Inconsistent accuracy:** A noticeable subset of answers contain wrong or even harmful fixes (e.g., removing valid flags, creating compile errors, re-introducing bugs).
434
-
-**Limited breadth:** Even when it finds a real defect it rarely reports additional related problems that peers catch, leading to partial reviews.
438
+
-**High miss rate:** In a large fraction of examples the model returned an empty list or covered only one minor issue while overlooking more serious newly-introduced bugs.
439
+
-**Inconsistent accuracy:** A noticeable subset of answers contain wrong or even harmful fixes (e.g., removing valid flags, creating compile errors, re-introducing bugs).
440
+
-**Limited breadth:** Even when it finds a real defect it rarely reports additional related problems that peers catch, leading to partial reviews.
435
441
-**Occasional guideline slips:** A few replies modify unchanged lines, suggest new imports, or duplicate suggestions, showing imperfect compliance with instructions.
436
442
443
+
### Claude-Opus-4.5 (high thinking budget)
444
+
445
+
Final score: **30.3**
446
+
447
+
Strengths:
448
+
449
+
-**High rule compliance & formatting:** Consistently produces valid YAML, respects the ≤3-suggestion limit, and usually confines edits to added lines, avoiding many guideline violations seen in peers.
450
+
-**Low false-positive rate:** Tends to stay silent unless convinced of a real problem; when the diff is a pure version bump / docs tweak it often (correctly) returns an empty list, beating noisier baselines.
451
+
-**Clear, focused patches when it fires:** In the minority of cases where it does spot a bug, it explains the issue crisply and supplies concise, copy-paste-able code snippets.
452
+
453
+
Weaknesses:
454
+
455
+
-**Very low recall:** In the vast majority of examples it misses obvious critical issues or suggests only a subset, frequently returning an empty list; this places it below most baselines on overall usefulness.
456
+
-**Shallow coverage:** Even when it catches a defect it typically lists a single point and overlooks other high-impact problems present in the same diff.
457
+
-**Occasional incorrect or incomplete fixes:** A non-trivial number of suggestions are wrong, compile-breaking, duplicate unchanged code, or touch out-of-scope lines, reducing trust.
458
+
-**Inconsistent severity tagging & duplication:** Sometimes mis-labels critical vs general, repeats the same suggestion, or leaves `improved_code` blocks empty.
459
+
437
460
## Appendix - Example Results
438
461
439
462
Some examples of benchmarked PRs and their results:
0 commit comments