Zing Forum

Reading

Fusion Mamba: An Interpretable Multimodal Framework for Mild Cognitive Impairment Detection

An interpretable framework based on the Mamba state space model and cross-modal attention fusion, which achieves automatic detection of mild cognitive impairment (MCI) by analyzing linguistic disfluencies and acoustic biomarkers in spontaneous speech, and has achieved excellent performance on multiple clinical datasets.

MCI检测Mamba多模态融合语音识别认知障碍医疗AI可解释AIWhispereGeMAPS阿尔茨海默病
Published 2026-03-30 15:04Recent activity 2026-03-30 15:29Estimated read 7 min
Fusion Mamba: An Interpretable Multimodal Framework for Mild Cognitive Impairment Detection
1

Section 01

Introduction to the Fusion Mamba Framework: An Interpretable Multimodal Solution for MCI Detection

Fusion Mamba is an interpretable framework based on the Mamba state space model and cross-modal attention fusion, which realizes automatic detection of mild cognitive impairment (MCI) by analyzing linguistic disfluencies and acoustic biomarkers in spontaneous speech. The framework has shown excellent performance on multiple clinical datasets, balancing detection performance and interpretability, and provides an efficient and reliable AI solution for early MCI screening.

2

Section 02

Research Background and Challenges: Urgent Need for MCI Detection and Existing Problems

The acceleration of global aging has made dementia such as Alzheimer's disease a public health challenge, and early detection of MCI (a transitional stage from normal aging to dementia) is crucial. Traditional clinical assessments rely on subjective judgment, are costly, and difficult to promote on a large scale. Speech-based MCI detection faces three major challenges: data scarcity (small sample size of high-quality clinical speech datasets), difficulty in cross-domain generalization (poor model transferability across different datasets, languages, or collection conditions), and insufficient interpretability (medical AI needs to gain clinical trust).

3

Section 03

Core Methods of the Fusion Mamba Framework: Multimodal Fusion and Technical Innovations

The framework uses dual-modal input: the language modality extracts transcribed text via Whisper Large-v3, and the acoustic modality extracts 88-dimensional eGeMAPS features via OpenSMILE. Key innovations include: 1. Mamba as the language encoder (freezing the pre-trained Mamba-130M backbone and only training the classification layer to avoid overfitting on small datasets); 2. Cross-modal attention fusion (concatenating features after projection, achieving dynamic fusion and interpretability through attention weights); 3. Hallucination filtering mechanism (triple loop detection, unique token ratio threshold, and WER verification to clean ASR results).

4

Section 04

Experimental Results and Key Findings: Cross-Dataset Performance and Modal Contribution Analysis

The framework was evaluated on three datasets: Pitt, ADReSS 2020, and TAUKADIAL. Under the unified pooling training strategy, the weighted F1 scores reached 0.946, 0.974, and 0.919 respectively; the performance of single-data-source transfer decreased significantly (e.g., the F1 score of the model trained on ADReSS was only 0.432-0.520 on TAUKADIAL). Modal contribution analysis shows that multimodal fusion mainly improves interpretability rather than accuracy (the average attention weight of language features is 88.1%); acoustic features such as jitter, shimmer, and HNR are highly correlated with MCI.

5

Section 05

Interpretability Analysis Suite: Transparent Decision Support Tools

The framework provides a complete set of interpretability tools: modal weight visualization (showing the proportion of modal contribution for samples), word-level perturbation analysis (identifying key cognitive marker words), feature category perturbation (analyzing the importance of acoustic feature categories), and FDR-corrected biomarker testing (statistical testing for significantly correlated acoustic features), which help understand model behavior and support clinical decision-making.

6

Section 06

Clinical Significance and Application Prospects: Transformation from Technology to Medical Scenarios

The framework proves that speech-based automatic MCI detection technology is feasible and close to expert level, and its interpretability design meets the regulatory requirements for medical AI. Potential application scenarios include cognitive screening for elderly people in communities, auxiliary diagnosis in primary care, longitudinal monitoring of cognitive impairment, etc. Early detection of MCI allows timely intervention to delay the progression of dementia.

7

Section 07

Limitations and Future Directions: Paths for Expansion and Optimization

Research limitations: small dataset size, mainly covering English and Mandarin-speaking populations, and only supporting binary classification. Future directions: introducing more modalities (facial expressions, eye movements), developing lightweight models to support edge deployment, building large-scale cross-language datasets, and expanding to multi-stage classification (normal/MCI/different dementia stages).

8

Section 08

Summary: Value and Contributions of the Fusion Mamba Framework

Fusion Mamba combines the efficient sequence modeling capability of Mamba with the interpretability advantages of cross-modal attention fusion, and has shown excellent performance on clinical datasets. Its core value lies in providing trustworthy explanations and evidence support for medical AI, promoting progress in the field of speech cognitive assessment, and facilitating early MCI detection and intervention.