Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna, Rahul Gupta, Kai-Wei Chang, Aram Galstyan, Charith Peris
View original ↗Build a framework that concurrently audits policy and reward models for systemic failures. This allows developers to test if their reward model is blind to specific unsafe agent trajectories.
Suggested repo: ARES-Audit
"Your reward model is a single point of failure; find its blind spots before your users do."
Estimated effort: 60h