jodah
View original ↗Develop an evaluation framework for testing the limits of multi-step agentic reasoning at scale. This tool should focus on measuring consistency and error propagation in long-chain operations.
Suggested repo: leap-eval
"Testing the breaking points of your agentic reasoning loops."
Estimated effort: 40h