Sihang Jiang, Lipeng Ma, Zhonghua Hong, Keyi Wang, Zhiyu Lu, Shisong Chen, Jinghao Zhang, Tianjun Pan, Weijia Zhou, Jiaqing Liang, Yanghua Xiao
View original ↗Build a benchmark runner for 'Self-Evolving Agents'. Create a suite that measures how well agents can modify their own task strategy over long-duration runs without human resets.
Suggested repo: evolve-bench
"Is your agent learning, or just guessing?"
Estimated effort: 100h