Xiang Long, Li Du, Yilong Xu, Fangcheng Liu, Haoqing Wang, Ning Ding, Ziheng Li, Jianyuan Guo, Yehui Tang
View original ↗Build a CLI evaluation harness that automates complex, multi-step agent interactions using this benchmark schema. Developers need a standard way to stress-test their agents in realistic, messy environments.
Suggested repo: clawbench
"Stop testing agents in vacuum; benchmark them in the wild."
Estimated effort: 40h