arXiv9h ago

Robust Reasoning Benchmark

Pavel Golikov, Evgenii Opryshko, Gennady Pekhimenko, Mark C. Jeffrey

View original ↗

Analysis

Viral velocity

low

Implementation gapNo

Novelty7/10

Categorytool

Topics

reasoningbenchmarkingrobustness

Opportunity Brief

Build a 'Robustness Perturbation Pipeline' that allows researchers to automatically re-format and mutate logic/math benchmarks to test model resilience.

Suggested repo: robust-logic

"Test if your model actually understands math or if it's just overfit to prompt templates."

Estimated effort: 25h