Jacob Dang, Brian Y. Xie, Omar G. Younis
View original ↗Create an adversarial testing framework that detects 'subliminal' behavioral leakage during agent distillation. Developers should build a suite that checks if benign-looking teacher trajectories introduce hidden malicious triggers in the student model.
Suggested repo: subliminal-guard
"Is your distilled agent hiding secret behaviors you didn't train it to have?"
Estimated effort: 40h