Chia-Hsuan Lee, Mingyang Zhou, Renkun Ni, Zelei Cheng, Sihui Dai, Supriyo Chakraborty, Shixiong Zhang, Sambit Sahu, William Campbell
View original ↗Develop a diagnostic toolkit that visualizes how specific preference pairs shift model latent activations during RLHF. This would allow developers to debug reasoning failures by identifying exactly which data properties trigger capability gains or regressions.
Suggested repo: PreferenceScanner
"See exactly why your model learned that answer."
Estimated effort: 40h