Daniel Shepard, Robin Salimans
View original ↗Build an open-source evaluation suite for cross-application agents that tests API discovery and policy adherence. This fills a gap in enterprise-grade agent testing which currently lacks standardized environments for multi-app workflows.
Suggested repo: auto-bench
"Stop testing agents with chatbots; start testing them with real business workflows."
Estimated effort: 80h