Zing Forum

Reading

MNDGNN: A Method for Identifying Cancer Driver Genes Based on Multiplex Networks and Directed Graph Neural Networks

This article introduces the MNDGNN model, an innovative multiplex network directed graph neural network method that addresses the issues of label scarcity and class imbalance in cancer driver gene identification by integrating multi-omics data and data augmentation techniques.

MNDGNN癌症驱动基因图神经网络多重网络多组学精准医学生物信息学深度学习数据增强
Published 2026-04-30 16:15Recent activity 2026-04-30 16:22Estimated read 9 min
MNDGNN: A Method for Identifying Cancer Driver Genes Based on Multiplex Networks and Directed Graph Neural Networks
1

Section 01

Introduction / Main Floor: MNDGNN: A Method for Identifying Cancer Driver Genes Based on Multiplex Networks and Directed Graph Neural Networks

This article introduces the MNDGNN model, an innovative multiplex network directed graph neural network method that addresses the issues of label scarcity and class imbalance in cancer driver gene identification by integrating multi-omics data and data augmentation techniques.

2

Section 02

Introduction: Core Challenges in Precision Oncology

The identification of cancer driver genes is fundamental to precision oncology research and clinical applications. These genes play a key role in tumor initiation and progression and are important targets for targeted therapy. However, this field faces two fundamental challenges: first, the complex regulatory relationships between genes are difficult to fully characterize using a single network; second, the number of experimentally validated cancer driver genes is extremely limited compared to the vast genome, leading to severe label scarcity and class imbalance issues. MNDGNN (Multiplex Networks-based Directed Graph Neural Network) is an innovative method proposed to address these problems.

3

Section 03

Limitations of Traditional Methods

Most existing cancer driver gene identification methods rely on a single biological network (such as the Protein-Protein Interaction network, PPI) to model gene relationships. This simplified approach has obvious shortcomings:

  • Single Perspective Limitation: Gene regulation in biological systems is multi-level and multi-type. PPI only reflects physical interactions between proteins and cannot cover other important dimensions such as transcriptional regulation, signaling pathways, and kinase-substrate relationships
  • Lack of Directionality: Many biological interactions have clear directionality (e.g., kinase phosphorylation of substrates), and undirected graphs cannot express this asymmetric relationship
  • Label Scarcity Dilemma: There are only hundreds of experimentally validated cancer driver genes, while the human genome has more than 20,000 protein-coding genes, resulting in an extremely imbalanced ratio of positive to negative samples
4

Section 04

Opportunities from Multi-omics Data

With the development of high-throughput sequencing technology, multi-omics data (genomics, transcriptomics, proteomics, etc.) and various biological network data have become increasingly abundant. This provides the possibility to integrate multiplex network information and build a more comprehensive gene relationship model.

5

Section 05

Key Innovations

MNDGNN proposes three key innovations:

  1. Multiplex Network Integration: Simultaneously uses multiple network types such as PPI, protein complexes, KEGG pathways, RegNetwork, DawnNet, and kinase-substrate networks
  2. Directed Graph Convolution: Designs a dedicated directed graph convolution operation to capture neighbor diversity and degree diversity
  3. Data Augmentation Strategy: Combines positive sample augmentation and negative sample inference to alleviate the label scarcity problem
6

Section 06

Model Architecture

Input Layer:

  • Multi-omics feature vectors (gene expression, mutation, copy number variation, etc.)
  • Multiplex adjacency matrices (one matrix per network type)

Directed Graph Convolution Layer:

Traditional Graph Convolutional Networks (GCN) assume the graph is undirected and all neighbors contribute equally to the central node. MNDGNN's directed graph convolution considers:

  • Neighbor Diversity: Different types of neighbors (upstream regulators, downstream targets, interacting proteins) should be treated differently
  • Degree Diversity: The in-degree and out-degree of a node reflect its different roles in the network

In implementation, the model learns independent convolution kernels for each network type and aggregates representations from different networks through an attention mechanism.

Data Augmentation Module:

To address the label scarcity problem, MNDGNN adopts a two-pronged strategy:

  • Positive Sample Augmentation: For known cancer driver genes, data expansion is performed using neighbor similarity in the network
  • Negative Sample Inference: Uses anomaly detection algorithms (e.g., DeepOD) to identify "high-confidence non-driver genes" from a large number of unlabeled genes as negative samples

Prediction Layer:

Uses a Multi-Layer Perceptron (MLP) to output the probability that each gene is a cancer driver gene, and uses class weights to handle class imbalance.

7

Section 07

Detailed Explanation of Network Types

MNDGNN integrates six types of biological networks:

  1. PPI Network: Physical interactions between proteins
  2. Protein Complex Network: Relationships between proteins that participate in the same complex
  3. KEGG Pathway Network: Gene relationships in metabolic and signaling pathways
  4. RegNetwork: Regulatory relationships between transcription factors and target genes
  5. DawnNet: Disease-related gene network
  6. Kinase-Substrate Network: Enzyme-substrate relationships in phosphorylation modification

These networks characterize functional associations between genes from different perspectives. After integration, they can more comprehensively reflect the potential role of genes in cancer development.

8

Section 08

Dataset

The study used the following data resources:

  • Multi-omics Data: Gene expression, mutation, and copy number variation data from projects such as TCGA
  • Validated Driver Genes: From authoritative databases such as the Cancer Gene Census
  • Candidate Gene Set: Possibly cancer-related genes that have undergone preliminary screening