Section 01
Multimodal Image-Audio Classification: Scene Understanding by Fusing Visual and Auditory Information
This project explores multimodal classification methods that fuse images and audio, aiming to achieve more accurate scene recognition by analyzing visual and auditory information simultaneously. To address the problem of incomplete information from a single modality, it focuses on key technologies such as feature extraction, modal fusion, and joint training, with the goal of developing an intelligent model that deeply integrates visual and auditory features to surpass the scene recognition performance of single-modal methods.