Vedant Jawandhia, Yash Sinha, Murari Mandal, Ankan Pal, Dhruv Kumar
View original ↗Develop a robustness benchmark tool that tests if an LLM solves math problems regardless of how the problem is expressed. This will expose fragility in current reasoning benchmarks.
Suggested repo: georep-eval
"Is your math model reasoning, or just pattern matching?"
Estimated effort: 35h