arXiv9h ago

Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation

Jacob Dang, Brian Y. Xie, Omar G. Younis

View original ↗

Analysis

Viral velocity

low

Implementation gapYES

Novelty8/10

Categorypaper

Topics

agentsdistillationsafetyalignment

Opportunity Brief

Create an adversarial testing framework that detects 'subliminal' behavioral leakage during agent distillation. Developers should build a suite that checks if benign-looking teacher trajectories introduce hidden malicious triggers in the student model.

Suggested repo: subliminal-guard

"Is your distilled agent hiding secret behaviors you didn't train it to have?"

Estimated effort: 40h