Develop an automated evaluation harness that stress-tests AI agents against adversarial benchmark inputs to identify common failure modes. This tool would allow researchers to standardize robustness testing across different LLM backends.