arXiv17h ago

SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

Sihang Jiang, Lipeng Ma, Zhonghua Hong, Keyi Wang, Zhiyu Lu, Shisong Chen, Jinghao Zhang, Tianjun Pan, Weijia Zhou, Jiaqing Liang, Yanghua Xiao

View original ↗

Analysis

Viral velocity

low

Implementation gapNo

Novelty8/10

Categorytool

Topics

agentsbenchmarkevaluation

Opportunity Brief

Build a benchmark runner for 'Self-Evolving Agents'. Create a suite that measures how well agents can modify their own task strategy over long-duration runs without human resets.

Suggested repo: evolve-bench

"Is your agent learning, or just guessing?"

Estimated effort: 100h