Section 01
[Introduction] GaMMA: A Large Multimodal Model for Joint Global-Temporal Music Understanding
GaMMA is a state-of-the-art large multimodal model for music content understanding, which uses a Mixture-of-Experts (MoE) audio encoder to unify temporal and non-temporal music understanding tasks. Through a progressive training process and evaluation on the MusicBench benchmark, it achieves new highs with accuracy rates of 79.1% on MuchoMusic, 79.3% on MusicBench-Temporal, and 81.3% on MusicBench-Global, establishing a new SOTA in music understanding.