Mete Ismayilzada, Renqing Cuomao, Daniil Yurshevich, Anna Sotnikova, Lonneke van der Plas, Antoine Bosselut
View original ↗Develop a benchmark runner for 'Creative Problem-Solving' that goes beyond simple brainteasers. This provides a standardized way to test if agents can actually synthesize new ideas.
Suggested repo: CresOWLve-Bench
"Test if your AI can actually solve new problems, or if it's just memorizing benchmarks."
Estimated effort: 40h