Xinyu Jessica Wang, Haoyue Bai, Yiyou Sun, Haorui Wang, Shuibai Zhang, Wenjie Hu, Mya Schroder, Bilge Mutlu, Dawn Song, Robert D Nowak
View original ↗Create an open-source evaluation suite for long-horizon agent tasks to benchmark existing frameworks against the new HORIZON dataset.
Suggested repo: horizon-eval
"Is your agent actually smart or just lucky? Stress test its long-horizon planning."
Estimated effort: 20h