Build a robustness testing tool for 'LLM-as-a-judge' scenarios that injects controlled semantic needles into documents to test sensitivity. This is vital for RAG-based companies trying to improve evaluation reliability.