Create a domain-specific dataset and benchmark suite for testing LLM reasoning on hard science (superconductivity/physics). Develop an agent that traverses academic paper graphs to verify claims.
Suggested repo: science-bench
"Testing if your LLM is actually smart enough to do science."
Estimated effort: 40h