Pavel Golikov, Evgenii Opryshko, Gennady Pekhimenko, Mark C. Jeffrey
View original ↗Build a 'Robustness Perturbation Pipeline' that allows researchers to automatically re-format and mutate logic/math benchmarks to test model resilience.
Suggested repo: robust-logic
"Test if your model actually understands math or if it's just overfit to prompt templates."
Estimated effort: 25h