arXiv3h ago

Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules

Cameron Pattison, Lorenzo Manuali, Seth Lazar

View original ↗

Analysis

Viral velocity

low

Implementation gapYES

Novelty8/10

Categorydiscussion

Topics

reasoningalignment

Opportunity Brief

Develop a framework for model alignment that allows users to define custom 'moral reasoning' schemas. This would enable local models to ignore illegitimate rules without breaking safety protocols.

Suggested repo: defiant-ai

"Teaching AI to discern unjust rules."

Estimated effort: 100h