AudioX: A Unified Diffusion Transformer for Anything-to-Audio and Music Generation

2025-04-14

Existing audio and music generation models suffer from limitations such as isolated operation across modalities, scarce high-quality multimodal training data, and difficulty integrating diverse inputs. AudioX, a unified Diffusion Transformer model, addresses these challenges by generating high-quality general audio and music with flexible natural language control and seamless processing of text, video, image, music, and audio. Its key innovation is a multimodal masked training strategy that enhances cross-modal representation learning. To overcome data scarcity, two comprehensive datasets were curated: vggsound-caps (190K audio captions) and V2M-caps (6 million music captions). Extensive experiments show AudioX matches or surpasses state-of-the-art specialized models in versatility and handling diverse input modalities within a unified architecture.

Read more