Zhuo Chen, Yuxuan Miao, Supryadi, Deyi Xiong
View original ↗Create an automated dashboard for LLM pretraining data composition. Developers need a way to track and adjust domain-level sampling weights efficiently during the training process.
Suggested repo: mix-opt
"Don't just collect data, optimize your pretraining mix."
Estimated effort: 20h