Create an open-source, local-first evaluation suite specifically for agentic research tasks. Developers should build a benchmark runner that audits the reasoning steps of local agents against a ground-truth dataset.