Develop a robustness benchmark tool that tests if an LLM solves math problems regardless of how the problem is expressed. This will expose fragility in current reasoning benchmarks.