Section 01
[Introduction] Audio-Omni: The First All-Round Framework Unifying Audio Understanding, Generation, and Editing
Audio-Omni is the first end-to-end unified framework that enables generation and editing across general sound, music, and speech domains, while integrating multimodal understanding capabilities. Its core architecture combines a frozen multimodal large language model (responsible for high-level semantic reasoning) and a trainable diffusion Transformer (responsible for high-fidelity synthesis), achieving state-of-the-art performance in multiple benchmark tests and providing a key breakthrough for the audio AI field to move toward general generative intelligence.