Mengzhao Jia, Zhihan Zhang, Meng Jiang
View original ↗Develop an RL library for multimodal reasoning that uses 'trajectory supervision' instead of just final answer reward. This helps developers build models that actually reason through visual tasks.
Suggested repo: vision-chain-rl
"Reward the thought process, not just the final guess."
Estimated effort: 120h