Reading

CMML: A Context-Driven Missing-Modality Learning Framework for Robust Medical Diagnosis

This article introduces the CMML framework, which addresses the problem of missing multimodal data in medical diagnosis using a cascaded residual Transformer autoencoder and learnable context tokens. It outperforms state-of-the-art methods on three datasets: skin lesions (Derm7pt), eye diseases (ODIR), and meningiomas (MEN).

多模态学习缺失模态医学诊断Transformer对比学习自编码器皮肤病变眼底疾病

Published 2026-05-25 23:44Recent activity 2026-05-26 14:51Estimated read 6 min

CMML: A Context-Driven Missing-Modality Learning Framework for Robust Medical Diagnosis

Section 01

Introduction: CMML Framework Empowers Robust Medical Diagnosis

This article introduces the Context-driven Missing-Modality Learning (CMML) framework, which addresses the challenge of missing modalities in medical diagnosis through innovative designs such as the Cascaded Residual Transformer Autoencoder (CRTA) and learnable context tokens. The framework outperforms state-of-the-art methods on three datasets: skin lesions (Derm7pt), eye diseases (ODIR), and meningiomas (MEN).

Original Authors and Source

Original Author/Maintainer: arXiv authors
Source Platform: arXiv
Original Title: Context-driven Missing-Modality Learning for Robust Medical Diagnosis with Image-Tabular Data
Original Link: http://arxiv.org/abs/2605.25968v1
Source Publication/Update Time: 2026-05-25T15:44:26Z

Section 02

Dilemma of Missing Modalities in Medical Diagnosis and Limitations of Existing Methods

In modern medical practice, fusion of multimodal data (medical images + clinical tables) can improve diagnostic accuracy, but random modality missing exists in reality due to issues like equipment, cost, and patient compliance.

Limitations of existing methods:

Directly discarding missing modalities: Loses valuable information and reduces diagnostic accuracy;
Simple interpolation or synthesis: Fails to capture complex dependencies between modalities, leading to low synthesis quality;
Modality-agnostic representation learning: Sacrifices modality specificity and lacks robustness.

Section 03

CMML Framework: Two-Stage Processing Flow

The core idea of the CMML framework is to use the overall semantic information of the dataset to guide missing modality synthesis and cross-modal alignment, adopting a two-stage strategy:

Modality Synthesis Stage: Synthesize representations of missing modalities;
Semantic Alignment Stage: Align all modality representations to a unified space.

This sequential design simplifies optimization difficulty and allows each stage to focus on its core task.

Section 04

CRTA Component: Innovative Design of Cascaded Residual Transformer Autoencoder

The core component for modality synthesis is the Cascade Residual Transformer-based Autoencoder (CRTA), whose key features include:

Learnable Context Tokens: Serve as dataset-level semantic priors, interact with available modalities via attention mechanisms to infer missing modality features;
Cascaded Residual Structure: Gradually refines features, and residual connections ensure effective gradient propagation;
Modality-Specific Memory Bank: Stores typical modality patterns to provide references for synthesis.

Section 05

Instance-Adaptive Semantic Alignment: Unifying Multimodal Representation Space

After synthesizing missing modalities, it is necessary to unify heterogeneous representations into a semantic space:

Instance-Adaptive Semantic Reference: Inject multimodal representations output by CRTA into context tokens, converting them into patient-specific knowledge as alignment guidance;
Category-Aware Contrastive Refinement: Through contrastive learning, similar samples are brought closer while dissimilar ones are kept apart, enhancing the discriminability of representations.

Section 06

Experimental Validation: Performance Improvement on Three Medical Datasets

Researchers validated the effectiveness of CMML on three datasets:

Derm7pt (Skin Lesions): 1.26% increase in average AUC;
ODIR (Eye Diseases): 0.97% increase in AUC;
MEN (Meningioma Grading): 1.32% performance improvement.

All datasets achieved stable improvements, and a 1% increase in the medical field has significant clinical value.

Section 07

Technical Insights and Future Directions

Technical insights from CMML:

Learnable context tokens demonstrate the value of dataset-level semantic priors;
The phased strategy simplifies optimization of complex tasks;
Instance adaptation connects global patterns with local features;
Category-aware contrastive learning enhances representation discriminability.

Future directions: Expand to more modalities (genomics, electronic medical record text) and apply to fields like autonomous driving and multi-sensor fusion.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15