Build a modular tool for designing multimodal data mixtures that allows for 'benchmark-targeted' training recipes. This simplifies the black art of balancing text, image, and video data for LLM training.