Create an open-source, local-first evaluation suite specifically for agentic research tasks. Developers should build a benchmark runner that audits the reasoning steps of local agents against a ground-truth dataset.
Suggested repo: research-eval-kit
"Verify your local research agent's performance with verifiable citations and audit logs."
Estimated effort: 30h