llmmadness
View original ↗Develop an evaluation suite that tests for 'hidden' model constraints that survive fine-tuning. This tool would help researchers identify alignment artifacts in supposedly uncensored models.
Suggested repo: uncensor-probe
"Find out what your 'uncensored' model is still hiding from you."
Estimated effort: 40h