Build a unified masked-diffusion architecture that handles text, audio, and visual modalities without separate decoders.