Zing Forum

Reading

CMML: A Context-Driven Missing-Modality Learning Framework for Robust Medical Diagnosis

This article introduces the CMML framework, which addresses the problem of missing multimodal data in medical diagnosis using a cascaded residual Transformer autoencoder and learnable context tokens. It outperforms state-of-the-art methods on three datasets: skin lesions (Derm7pt), eye diseases (ODIR), and meningiomas (MEN).

多模态学习缺失模态医学诊断Transformer对比学习自编码器皮肤病变眼底疾病
Published 2026-05-25 23:44Recent activity 2026-05-26 14:51Estimated read 6 min
CMML: A Context-Driven Missing-Modality Learning Framework for Robust Medical Diagnosis
1

Section 01

Introduction: CMML Framework Empowers Robust Medical Diagnosis

This article introduces the Context-driven Missing-Modality Learning (CMML) framework, which addresses the challenge of missing modalities in medical diagnosis through innovative designs such as the Cascaded Residual Transformer Autoencoder (CRTA) and learnable context tokens. The framework outperforms state-of-the-art methods on three datasets: skin lesions (Derm7pt), eye diseases (ODIR), and meningiomas (MEN).

Original Authors and Source

  • Original Author/Maintainer: arXiv authors
  • Source Platform: arXiv
  • Original Title: Context-driven Missing-Modality Learning for Robust Medical Diagnosis with Image-Tabular Data
  • Original Link: http://arxiv.org/abs/2605.25968v1
  • Source Publication/Update Time: 2026-05-25T15:44:26Z
2

Section 02

Dilemma of Missing Modalities in Medical Diagnosis and Limitations of Existing Methods

In modern medical practice, fusion of multimodal data (medical images + clinical tables) can improve diagnostic accuracy, but random modality missing exists in reality due to issues like equipment, cost, and patient compliance.

Limitations of existing methods:

  1. Directly discarding missing modalities: Loses valuable information and reduces diagnostic accuracy;
  2. Simple interpolation or synthesis: Fails to capture complex dependencies between modalities, leading to low synthesis quality;
  3. Modality-agnostic representation learning: Sacrifices modality specificity and lacks robustness.
3

Section 03

CMML Framework: Two-Stage Processing Flow

The core idea of the CMML framework is to use the overall semantic information of the dataset to guide missing modality synthesis and cross-modal alignment, adopting a two-stage strategy:

  1. Modality Synthesis Stage: Synthesize representations of missing modalities;
  2. Semantic Alignment Stage: Align all modality representations to a unified space.

This sequential design simplifies optimization difficulty and allows each stage to focus on its core task.

4

Section 04

CRTA Component: Innovative Design of Cascaded Residual Transformer Autoencoder

The core component for modality synthesis is the Cascade Residual Transformer-based Autoencoder (CRTA), whose key features include:

  1. Learnable Context Tokens: Serve as dataset-level semantic priors, interact with available modalities via attention mechanisms to infer missing modality features;
  2. Cascaded Residual Structure: Gradually refines features, and residual connections ensure effective gradient propagation;
  3. Modality-Specific Memory Bank: Stores typical modality patterns to provide references for synthesis.
5

Section 05

Instance-Adaptive Semantic Alignment: Unifying Multimodal Representation Space

After synthesizing missing modalities, it is necessary to unify heterogeneous representations into a semantic space:

  1. Instance-Adaptive Semantic Reference: Inject multimodal representations output by CRTA into context tokens, converting them into patient-specific knowledge as alignment guidance;
  2. Category-Aware Contrastive Refinement: Through contrastive learning, similar samples are brought closer while dissimilar ones are kept apart, enhancing the discriminability of representations.
6

Section 06

Experimental Validation: Performance Improvement on Three Medical Datasets

Researchers validated the effectiveness of CMML on three datasets:

  • Derm7pt (Skin Lesions): 1.26% increase in average AUC;
  • ODIR (Eye Diseases): 0.97% increase in AUC;
  • MEN (Meningioma Grading): 1.32% performance improvement.

All datasets achieved stable improvements, and a 1% increase in the medical field has significant clinical value.

7

Section 07

Technical Insights and Future Directions

Technical insights from CMML:

  1. Learnable context tokens demonstrate the value of dataset-level semantic priors;
  2. The phased strategy simplifies optimization of complex tasks;
  3. Instance adaptation connects global patterns with local features;
  4. Category-aware contrastive learning enhances representation discriminability.

Future directions: Expand to more modalities (genomics, electronic medical record text) and apply to fields like autonomous driving and multi-sensor fusion.