Create a sensitivity testing suite for 'LLM-as-a-Judge'. This tool identifies where LLM evaluators fail by systematically inserting 'semantic needles'—subtle perturbations like negations—into comparison documents.