arXiv9h ago

Finding and Reactivating Post-Trained LLMs' Hidden Safety Mechanisms

Mingjie Li, Wai Man Si, Michael Backes, Yang Zhang, Yisen Wang

Analysis

Viral velocity

low

Implementation gapYES

Novelty7/10

Categorytool

Topics

fine-tuningsafety

Develop a tool to audit and recover safety alignment mechanisms that are often 'bleached' during heavy CoT fine-tuning.

Suggested repo: safe-react

"Find and rescue the safety mechanisms lost in your fine-tuning run."

Estimated effort: 50h