arXiv10h ago

Data Mixing for Large Language Models Pretraining: A Survey and Outlook

Zhuo Chen, Yuxuan Miao, Supryadi, Deyi Xiong

View original ↗

Analysis

Viral velocity

low

Implementation gapYES

Novelty5/10

Categorypaper

Topics

trainingllmdata-curation

Opportunity Brief

Create an automated dashboard for LLM pretraining data composition. Developers need a way to track and adjust domain-level sampling weights efficiently during the training process.

Suggested repo: mix-opt

"Don't just collect data, optimize your pretraining mix."

Estimated effort: 20h