Sinan G. Aksoy, Alexandra A. Sabrio, Erik VonKaenel, Lee Burke
View original ↗Create an open-source library for automated robustness testing of LLM-as-a-judge systems using adversarial document perturbation. This allows RAG developers to measure how fragile their evaluation pipeline is to minor text changes.
Suggested repo: judge-stress
"Is your LLM evaluator actually blind?"
Estimated effort: 40h