Add ClawBench to Benchmark section#144
Conversation
ClawBench is a benchmark of 153 everyday tasks across 144 live production websites in 15 categories, designed to evaluate web agents end-to-end on real sites. It is safety-relevant for this list because it introduces a submission-interception layer that prevents agents from executing real side effects (checkouts, bookings, job applications) while still scoring end-state correctness — a practical safety mechanism for evaluating write-heavy agent tasks on production sites, complementing AgentDojo and AgentHarm. Paper: https://arxiv.org/abs/2604.08523 Repo: https://github.com/reacher-z/ClawBench License: MIT
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📜 Recent review details🧰 Additional context used🧠 Learnings (1)📚 Learning: 2026-03-08T05:45:19.349ZApplied to files:
🔇 Additional comments (1)
WalkthroughREADME.md의 "Benchmark" 섹션에 ClawBench (2026-04) 항목이 추가되었습니다. web-agent 태그와 함께 논문 및 저장소 링크가 포함되어 있으며, 기존 리스트 구조는 수정되지 않았습니다. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Small nudge here — let me know if you'd like the entry reformatted, rebased, or split into smaller commits. Happy to make whatever tweaks would help it land. |
Adds ClawBench to the
## Benchmarksection.What it is
ClawBench is a benchmark of 153 everyday tasks across 144 live production websites in 15 categories, evaluating LLM web agents end-to-end on real sites (e.g. shopping, bookings, forms, account actions).
uv tool install clawbench-evalWhy it fits this list
ClawBench's core contribution is safety-relevant: it introduces a submission-interception layer that prevents agents from executing real side effects — checkouts, bookings, job applications, account mutations — while still scoring end-state correctness from captured submission payloads.
This is, to our knowledge, the only approach that makes live-site agent evaluation safe for write-heavy tasks, directly addressing a practical safety problem in agentic AI evaluation. It complements the existing entries AgentDojo and AgentHarm, which focus on prompt-injection and harmfulness evaluation in controlled environments — ClawBench extends the agent-safety evaluation story to production websites.
Format
Entry is appended in chronological order, using the same
"Title", YYYY-MM, [[paper]] [[repo]]format as surrounding entries and aweb-agenttag consistent with existing tags in the Papers section.Disclosure: opened by a ClawBench contributor.
Summary by CodeRabbit
릴리스 노트