Add ClawBench to Benchmark section by reacher-z · Pull Request #144 · corca-ai/awesome-llm-security

reacher-z · 2026-04-15T20:26:11Z

Adds ClawBench to the ## Benchmark section.

What it is

ClawBench is a benchmark of 153 everyday tasks across 144 live production websites in 15 categories, evaluating LLM web agents end-to-end on real sites (e.g. shopping, bookings, forms, account actions).

Paper: https://arxiv.org/abs/2604.08523
Repo: https://github.com/reacher-z/ClawBench
Install: uv tool install clawbench-eval
License: MIT

Why it fits this list

ClawBench's core contribution is safety-relevant: it introduces a submission-interception layer that prevents agents from executing real side effects — checkouts, bookings, job applications, account mutations — while still scoring end-state correctness from captured submission payloads.

This is, to our knowledge, the only approach that makes live-site agent evaluation safe for write-heavy tasks, directly addressing a practical safety problem in agentic AI evaluation. It complements the existing entries AgentDojo and AgentHarm, which focus on prompt-injection and harmfulness evaluation in controlled environments — ClawBench extends the agent-safety evaluation story to production websites.

Format

Entry is appended in chronological order, using the same "Title", YYYY-MM, [[paper]] [[repo]] format as surrounding entries and a web-agent tag consistent with existing tags in the Papers section.

Disclosure: opened by a ClawBench contributor.

Summary by CodeRabbit

릴리스 노트

문서
- 벤치마크 섹션에 새로운 항목이 추가되었습니다. ClawBench (2026-04)와 관련 자료에 대한 정보가 포함되어 있습니다.

ClawBench is a benchmark of 153 everyday tasks across 144 live production websites in 15 categories, designed to evaluate web agents end-to-end on real sites. It is safety-relevant for this list because it introduces a submission-interception layer that prevents agents from executing real side effects (checkouts, bookings, job applications) while still scoring end-state correctness — a practical safety mechanism for evaluating write-heavy agent tasks on production sites, complementing AgentDojo and AgentHarm. Paper: https://arxiv.org/abs/2604.08523 Repo: https://github.com/reacher-z/ClawBench License: MIT

coderabbitai · 2026-04-15T20:26:25Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b29db175-af1a-448b-a514-34e7fe0e817b

📥 Commits

Reviewing files that changed from the base of the PR and between c8ae124 and 420b870.

📒 Files selected for processing (1)

README.md

📜 Recent review details

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2026-03-08T05:45:19.349Z

Learnt from: MaxwellCalkin
Repo: corca-ai/awesome-llm-security PR: 0
File: :0-0
Timestamp: 2026-03-08T05:45:19.349Z
Learning: The MaxwellCalkin/sentinel-ai repo uses compiled regex patterns (not ML inference) for LLM safety guardrails, achieving sub-millisecond latency. Its benchmark is fully self-curated (author designed both the regex patterns and the test cases), so 100% accuracy on that benchmark is expected by design. The ~20ms/~0.987 "Sentinel" figures are from Qualifire's unrelated "Sentinel" product, not this library.

Applied to files:

README.md

🔇 Additional comments (1)

README.md (1)

105-105: 좋은 추가입니다.

Benchmark 섹션의 연대순 정렬과 기존 항목 스타일을 잘 맞춰서 반영되었습니다.

Walkthrough

README.md의 "Benchmark" 섹션에 ClawBench (2026-04) 항목이 추가되었습니다. web-agent 태그와 함께 논문 및 저장소 링크가 포함되어 있으며, 기존 리스트 구조는 수정되지 않았습니다.

Changes

Cohort / File(s)	Summary
README 벤치마크 업데이트 `README.md`	Benchmark 섹션에 ClawBench (2026-04) 항목 추가, web-agent 태그 및 관련 링크 포함

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and accurately summarizes the main change: adding ClawBench to the Benchmark section of the README.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

reacher-z · 2026-04-16T12:26:36Z

Small nudge here — let me know if you'd like the entry reformatted, rebased, or split into smaller commits. Happy to make whatever tweaks would help it land.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ClawBench to Benchmark section#144

Add ClawBench to Benchmark section#144
reacher-z wants to merge 1 commit intocorca-ai:mainfrom
reacher-z:add-clawbench-benchmark

reacher-z commented Apr 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

reacher-z commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

reacher-z commented Apr 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What it is

Why it fits this list

Format

Summary by CodeRabbit

릴리스 노트

Uh oh!

coderabbitai Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

reacher-z commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

reacher-z commented Apr 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 15, 2026 •

edited

Loading