Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, Sanjeev Arora
View original ↗Build a lightweight training harness that converts binary reward signals into dense supervision signals through self-distillation. This allows models to learn effectively in sparse reward environments without external teacher labels.
Suggested repo: self-distill
"Turn sparse RL rewards into dense token-level training signals automatically."
Estimated effort: 40h