Fei Ding, Yongkang Zhang, youwei wang, Zijian Zeng
View original ↗Create a fine-tuning library that explicitly handles 'token gradient cancellation' to prevent entropy collapse during RL-based reasoning model training. This is a critical utility for teams training large reasoning models on long-horizon tasks.
Suggested repo: reason-tune
"Stop your RL training from collapsing at the finish line."
Estimated effort: 40h