Haoran Xu, Kaiwen Hu, Somayeh Sojoudi, Amy Zhang
View original ↗Provide a clean reference implementation of Value Gradient Flow for RL finetuning of LLMs. This is a critical building block for developers looking to move beyond standard DPO for behavior regularization.
Suggested repo: vgrad
"The end of brittle RLHF: learn stable policies via value gradient flow."
Estimated effort: 50h