arXiv9h ago

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng, Jiajun Bao, Yuanli Wang, Weixiang Yan, Yiyuan Li, Han-chung Lee

View original ↗

Analysis

Viral velocity

low

Implementation gapYES

Novelty8/10

Categorypaper

Topics

agentsevaluationsafety

Opportunity Brief

Release a plug-and-play evaluation suite that sets up stateful, multi-service productivity environments (email, calendar, CRM) for benchmarking agents safely. It bridges the gap between toy benchmarks and real-world utility.

Suggested repo: ClawsBenchmark

"Test your productivity agents in realistic, multi-service workspaces—without the risk."

Estimated effort: 70h