arXiv2d ago

The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

Xinyu Jessica Wang, Haoyue Bai, Yiyou Sun, Haorui Wang, Shuibai Zhang, Wenjie Hu, Mya Schroder, Bilge Mutlu, Dawn Song, Robert D Nowak

Analysis

Viral velocity

low

Implementation gapYES

Novelty6/10

Categorypaper

Topics

agentsreasoningbenchmark

Create an open-source evaluation suite for long-horizon agent tasks to benchmark existing frameworks against the new HORIZON dataset.

Suggested repo: horizon-eval

"Is your agent actually smart or just lucky? Stress test its long-horizon planning."

Estimated effort: 20h