Jinghan Zhang, Fengran Mo, Tharindu Cyril Weerasooriya, Ruimin Dai, Xiaoyan Han, Yanjie Fu, Dakuo Wang, Kunpeng Liu
View original ↗Build an RL policy optimizer that rewards internal structural consistency rather than just output correctness. Develop a custom loss function that penalizes illogical reasoning paths.
Suggested repo: logic-rl
"Train LLMs that think, don't just guess."
Estimated effort: 90h