arXiv9h ago

Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring

Sinan G. Aksoy, Alexandra A. Sabrio, Erik VonKaenel, Lee Burke

View original ↗

Analysis

Viral velocity

low

Implementation gapYES

Novelty5/10

Categorytool

Topics

evaluationraginference

Opportunity Brief

Create an open-source library for automated robustness testing of LLM-as-a-judge systems using adversarial document perturbation. This allows RAG developers to measure how fragile their evaluation pipeline is to minor text changes.

Suggested repo: judge-stress

"Is your LLM evaluator actually blind?"

Estimated effort: 40h