Zing Forum

Reading

AudioMCQ: A New Milestone in Advancing Post-Training for Large Audio Language Models

AudioMCQ is an audio multiple-choice question dataset with 571,000 samples, designed specifically for the post-training of Large Audio Language Models (LALMs). Through its dual-chain thinking annotation and audio contribution filtering mechanism, the dataset achieves state-of-the-art performance in audio understanding tasks and won first place in the DCASE 2025 Challenge.

AudioMCQ音频语言模型多模态学习DCASE 2025链式思维音频理解数据集后训练ICLR 2026
Published 2026-04-13 15:13Recent activity 2026-04-13 15:18Estimated read 7 min
AudioMCQ: A New Milestone in Advancing Post-Training for Large Audio Language Models
1

Section 01

AudioMCQ: A New Milestone in Advancing Post-Training for Large Audio Language Models

AudioMCQ is a large-scale multiple-choice question dataset designed specifically for the post-training of Large Audio Language Models (LALMs), containing 571,000 samples. Its core innovations include a dual-chain thinking annotation mechanism and an audio contribution filtering framework, which effectively address the problem of models over-relying on text priors. The dataset won first place in the DCASE 2025 Challenge, filling the gap in audio contribution-aware datasets and advancing the development of audio language models.

2

Section 02

Background: Core Challenges Faced by Audio Language Models

With the development of multimodal large language models, audio understanding ability has become an important dimension to measure comprehensive intelligence. However, existing models tend to over-rely on prior knowledge from text prompts when handling audio question answering, rather than truly understanding the audio content. This "spurious correlation" limits their practical application value. To address this, the inclusionAI team proposed the AudioMCQ dataset, introducing an "audio contribution-aware" training paradigm aimed at building systems with real audio understanding capabilities.

3

Section 03

Core Design Features of the AudioMCQ Dataset

Scale and Coverage

AudioMCQ contains 571,000 samples covering four major domains: sound, music, speech, and time series. Presented in multiple-choice question format, it balances automated evaluation and fine-grained understanding testing.

Dual-Chain Thinking Annotation

It adopts two reasoning paths: structured (logical steps + intermediate conclusions) and unstructured (natural and flexible reasoning), helping models learn systematic decomposition and creative thinking.

Audio Contribution Filtering

Samples are divided into two categories: weak contribution (54.8%, answerable with text alone) and strong contribution (45.2%, requiring deep audio understanding), guiding models to balance the weight of audio and text information.

4

Section 04

Innovative Training Paradigms and Evaluation Metrics

Training Strategies

  • Weak-to-Strong Paradigm: Pre-train on weak contribution samples first, then transition to strong contribution samples to avoid "shortcut learning."
  • Mixed-to-Strong Paradigm: Mix the two types of samples and assign higher weights to strong contribution samples via loss functions, balancing stability and deep understanding.

Innovation in Evaluation Metrics

Introduced MMAR (Multimodal Audio Reasoning) and MMAU (Multimodal Audio Understanding) metrics to evaluate the rationality of the model's decision-making process and accurately reflect the depth of audio understanding.

5

Section 05

Experimental Results: DCASE2025 Champion and Model Performance Improvement

  • Competition Results: AudioMCQ won first place in the DCASE 2025 Audio Question Answering Challenge, verifying its effectiveness in practical applications.
  • Model Improvement: Models post-trained with AudioMCQ show significant improvements in robustness and accuracy in complex audio understanding scenarios. The team open-sourced model checkpoints for the Weak-to-Strong and Mixed-to-Strong paradigms.
  • Community Feedback: In April 2026, the evaluation script was revised to ensure the accuracy of the MMSU metric. The AudioMCQ-StrongAC-GeminiCoT subset (CoT generated by Gemini 3.1 Pro) was released and designated as the official training data for DCASE2026 Task5.
6

Section 06

Application Prospects and Academic Value

  • Advancing LALMs Development: Fills the gap in large-scale audio contribution-aware datasets, and standardized resources accelerate progress in the field.
  • Inspiration for Multimodal Fusion: The concept of audio contribution can be extended to visual, tactile, and other modalities, helping build balanced and reliable multimodal systems.
  • Industrial Applications: Models can be applied to scenarios such as intelligent customer service, audio auditing, medical auscultation, and industrial fault detection, enhancing application value in vertical fields.
7

Section 07

Conclusion: Milestone Significance and Future Outlook of AudioMCQ

AudioMCQ is an important milestone in the construction of training data for audio language models. It establishes a new training paradigm through audio contribution awareness and dual-chain annotation, guiding models to deeply understand audio. Its acceptance by ICLR2026 and victory in DCASE2025 demonstrate its academic value and practical potential. Subsequent versions (such as StrongAC-GeminiCoT) and its adoption in DCASE2026 will continue to drive progress in the field. Researchers and engineers who deeply apply its concepts can grasp the development trends of audio language models.