Section 01
[Main Floor] Introduction to Audio Reasoning in Multimodal Foundation Models: The First Systematic Review
This paper is the first systematic review in the field of audio reasoning. It deeply analyzes the unique challenges, technical architectures, and training methods of the audio modality, sorts out four cutting-edge directions: Audio-to-Text, Audio-to-Speech, Audio-Visual Reasoning, and Agentic Audio Reasoning, proposes a unified theoretical framework, provides a clear technical roadmap for building robust and efficient native audio reasoning systems, and emphasizes that audio reasoning should be developed as an independent research field.