Reading

ST-Path Survey: A Review of Multimodal Fusion Between Spatial Transcriptomics and Pathology

A systematic review study that comprehensively summarizes multimodal fusion technologies in the fields of spatial transcriptomics and pathology, proposes a three-layer classification system, and maps the technology development roadmap from 2018 to 2025.

空间转录组学病理学多模态融合基础模型生物医学 AIGitHub综述

Published 2026-05-03 20:55Recent activity 2026-05-03 21:28Estimated read 11 min

ST-Path Survey: A Review of Multimodal Fusion Between Spatial Transcriptomics and Pathology

Section 01

ST-Path Survey: Introduction to the Review of Multimodal Fusion Between Spatial Transcriptomics and Pathology

ST-Path Survey is a systematic review study that comprehensively summarizes multimodal fusion technologies in the fields of spatial transcriptomics (ST) and pathology. It proposes a three-layer classification system (embedding layer, model layer, knowledge layer) and maps the technology development roadmap from 2018 to 2025. Maintained by ChlorineHi, this open-source project provides resources such as paper codes, data, and standardized evaluation frameworks, aiming to fill the gap of the lack of systematic organization in the field and offer technical references for researchers.

Section 02

Research Background and Significance

Research Background

Spatial transcriptomics (ST) can preserve tissue spatial information and measure gene expression, while pathology analyzes tissue morphological features through microscopic images. Their fusion enables a more comprehensive understanding of disease mechanisms (especially in cancer research). In recent years, deep learning has driven progress in multimodal fusion, but the field lacks systematic organization.

Project Significance

The ST-Path Survey project fills this gap and provides researchers with a comprehensive technical review and development roadmap.

Section 03

Detailed Explanation of the Three-Layer Classification System

Embedding Layer Fusion

Focuses on integration at the feature representation level:

Early fusion: Concatenate/transform raw data into a unified space—simple but prone to losing modal information;
Late fusion: Fusion at the decision layer after feature extraction from each modality—preserves modal specificity but lacks interaction information;
Middle fusion: Fusion at the intermediate layer of feature extraction—balances advantages and disadvantages, and is the mainstream method. Representative methods: Cross-modal alignment with attention, contrastive learning for modal representation, autoencoder for shared latent space.

Model Layer Fusion

Focuses on network architecture design:

Encoder-decoder: Independent encoders process modalities, and a shared decoder outputs results;
Transformer: Self-attention handles multimodal sequences (joint modeling of ViT and gene expression);
GNN: Model tissue slices as graphs to capture spatial dependencies;
Hybrid architecture: Combine the advantages of CNN, Transformer, and GNN.

Knowledge Layer Fusion

Focuses on domain knowledge integration:

Prior knowledge embedding: Embed biological pathways and gene regulatory networks into models as graphs/constraints;
Causal reasoning: Infer causal relationships between gene expression and morphological features;
Interpretability: Attention visualization, feature attribution, etc.;
Knowledge graph integration: Pathological-genomic graphs support reasoning.

Section 04

Technology Development Roadmap (2018-2025)

2018-2020: Rise of Representation Learning

Deep learning applications in single modalities:

2018: DeepST uses CNN to process ST data;
2019: BERT inspires gene expression sequence modeling;
2020: Self-supervised learning applied to pathological images. Breakthroughs: Spatial information encoding, gene expression dimensionality reduction, pathological image slice processing.

2020-2022: Exploration of Multimodal Fusion

Systematic fusion of two modalities:

2020: First batch of multimodal methods emerged;
2021: Contrastive learning showed potential;
2022: Attention became the standard for cross-modal alignment. Representative works: ST-Net, DeepSpaCE, HisToGene.

2022-2024: Era of Foundation Models

Dominance of large-scale pre-training:

2022: CLIP inspired biomedical applications;
2023: Pathological image foundation models (UNI, Prov-GigaPath) released;
2024: ST foundation models emerged. Trends: Self-supervised pre-training becomes standard, model scale grows, multi-task capability improves.

2024-2025: Unification and Standardization

Establishment of unified frameworks and evaluation standards:

Construction of large-scale multi-center datasets;
Standardized benchmark testing;
Improvement of open-source tool ecosystems;
Acceleration of clinical translation.

Section 05

Key Technical Challenges

Data Heterogeneity

Resolution mismatch: High resolution of pathological images vs. regionality of gene expression;
Data sparsity: Large number of zero values in ST data vs. dense pathological images;
Scale difference: Molecular level vs. cellular/tissue level. Solutions: Multi-scale feature pyramids, cross-resolution alignment, missing data imputation.

Interpretability Requirements

Biomedicine requires model interpretability:

Explain the reasons for predictions;
Identify key genes and morphological features;
Discover new mechanisms. Progress: Attention visualization, SHAP/Integrated Gradients feature attribution, CAV.

Data Privacy and Sharing

Sensitive medical data limits dataset construction:

Privacy regulations (HIPAA, GDPR);
Barriers to inter-institutional sharing;
Difficulty in obtaining annotated data. Responses: Federated learning, synthetic data generation, transfer learning/domain adaptation.

Section 06

Application Scenarios and Clinical Value

Cancer Typing and Prognosis

Distinguish subtypes that are difficult to classify with traditional methods;
Predict prognosis and treatment response;
Discover new therapeutic targets.

Tumor Microenvironment Analysis

Immune cell infiltration patterns;
Tumor-stroma boundary features;
Quantification of spatial heterogeneity.

Drug Response Prediction

Chemotherapy sensitivity prediction;
Immune therapy response assessment;
Drug resistance mechanism research.

Section 07

Future Development Directions and Summary

Future Development Directions

Large-scale pre-training: Integration of millions of slice data, optimization of self-supervised strategies;
Multimodal foundation models: Process images/genes/text, zero-shot learning, cross-cancer generalization;
Causal reasoning: Causal inference between genes and morphology, treatment mechanism modeling;
Clinical translation: Real-time analysis systems, regulatory approval, workflow integration.

Summary

ST-Path Survey provides a systematic review for the field. The three-layer classification system and roadmap help researchers clarify the technical context. It is an important reference for researchers in computational pathology, bioinformatics, and medical AI, and the field is expected to play a greater role in precision medicine.