Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng, Jiajun Bao, Yuanli Wang, Weixiang Yan, Yiyuan Li, Han-chung Lee
View original ↗Release a plug-and-play evaluation suite that sets up stateful, multi-service productivity environments (email, calendar, CRM) for benchmarking agents safely. It bridges the gap between toy benchmarks and real-world utility.
Suggested repo: ClawsBenchmark
"Test your productivity agents in realistic, multi-service workspaces—without the risk."
Estimated effort: 70h