Wangjie Gan, Miao Pan, Linbo Xi, Wenqi Zhang, Jintao Chen, Jianwei Yin, Xuhong Zhang
View original ↗Create a library that integrates group-based advantage functions into standard RLHF pipelines. This allows for more stable policy updates by rectifying training coefficients based on group performance metrics.
Suggested repo: gft-trainer
"Stop training on noisy averages and start optimizing groups."
Estimated effort: 40h