Develop a framework that performs iterative red-teaming to patch both the policy model and the reward model simultaneously. This tool would automate the detection of 'synergistic' failure modes where the RM is blind to model exploits.