Skip to content

Add ClawBench to Benchmark section#144

Open
reacher-z wants to merge 1 commit intocorca-ai:mainfrom
reacher-z:add-clawbench-benchmark
Open

Add ClawBench to Benchmark section#144
reacher-z wants to merge 1 commit intocorca-ai:mainfrom
reacher-z:add-clawbench-benchmark

Conversation

@reacher-z
Copy link
Copy Markdown

@reacher-z reacher-z commented Apr 15, 2026

Adds ClawBench to the ## Benchmark section.

What it is

ClawBench is a benchmark of 153 everyday tasks across 144 live production websites in 15 categories, evaluating LLM web agents end-to-end on real sites (e.g. shopping, bookings, forms, account actions).

Why it fits this list

ClawBench's core contribution is safety-relevant: it introduces a submission-interception layer that prevents agents from executing real side effects — checkouts, bookings, job applications, account mutations — while still scoring end-state correctness from captured submission payloads.

This is, to our knowledge, the only approach that makes live-site agent evaluation safe for write-heavy tasks, directly addressing a practical safety problem in agentic AI evaluation. It complements the existing entries AgentDojo and AgentHarm, which focus on prompt-injection and harmfulness evaluation in controlled environments — ClawBench extends the agent-safety evaluation story to production websites.

Format

Entry is appended in chronological order, using the same "Title", YYYY-MM, [[paper]] [[repo]] format as surrounding entries and a web-agent tag consistent with existing tags in the Papers section.

Disclosure: opened by a ClawBench contributor.

Summary by CodeRabbit

릴리스 노트

  • 문서
    • 벤치마크 섹션에 새로운 항목이 추가되었습니다. ClawBench (2026-04)와 관련 자료에 대한 정보가 포함되어 있습니다.

ClawBench is a benchmark of 153 everyday tasks across 144 live
production websites in 15 categories, designed to evaluate web
agents end-to-end on real sites. It is safety-relevant for this
list because it introduces a submission-interception layer that
prevents agents from executing real side effects (checkouts,
bookings, job applications) while still scoring end-state
correctness — a practical safety mechanism for evaluating
write-heavy agent tasks on production sites, complementing
AgentDojo and AgentHarm.

Paper: https://arxiv.org/abs/2604.08523
Repo:  https://github.com/reacher-z/ClawBench
License: MIT
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 15, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b29db175-af1a-448b-a514-34e7fe0e817b

📥 Commits

Reviewing files that changed from the base of the PR and between c8ae124 and 420b870.

📒 Files selected for processing (1)
  • README.md
📜 Recent review details
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-03-08T05:45:19.349Z
Learnt from: MaxwellCalkin
Repo: corca-ai/awesome-llm-security PR: 0
File: :0-0
Timestamp: 2026-03-08T05:45:19.349Z
Learning: The MaxwellCalkin/sentinel-ai repo uses compiled regex patterns (not ML inference) for LLM safety guardrails, achieving sub-millisecond latency. Its benchmark is fully self-curated (author designed both the regex patterns and the test cases), so 100% accuracy on that benchmark is expected by design. The ~20ms/~0.987 "Sentinel" figures are from Qualifire's unrelated "Sentinel" product, not this library.

Applied to files:

  • README.md
🔇 Additional comments (1)
README.md (1)

105-105: 좋은 추가입니다.

Benchmark 섹션의 연대순 정렬과 기존 항목 스타일을 잘 맞춰서 반영되었습니다.


Walkthrough

README.md의 "Benchmark" 섹션에 ClawBench (2026-04) 항목이 추가되었습니다. web-agent 태그와 함께 논문 및 저장소 링크가 포함되어 있으며, 기존 리스트 구조는 수정되지 않았습니다.

Changes

Cohort / File(s) Summary
README 벤치마크 업데이트
README.md
Benchmark 섹션에 ClawBench (2026-04) 항목 추가, web-agent 태그 및 관련 링크 포함

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately summarizes the main change: adding ClawBench to the Benchmark section of the README.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@reacher-z
Copy link
Copy Markdown
Author

Small nudge here — let me know if you'd like the entry reformatted, rebased, or split into smaller commits. Happy to make whatever tweaks would help it land.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant