Zing Forum

Reading

Drug-Target Interaction Prediction: A Comparative Study of Molecular Fingerprints and Graph Neural Networks

This project on EGFR inhibitor activity prediction based on the ChEMBL database compares two methods—Morgan molecular fingerprints + Random Forest and Graph Neural Networks (GNNs)—and implements a complete machine learning workflow using RDKit, PyTorch Geometric, and SHAP.

药物发现药物-靶点相互作用分子指纹图神经网络EGFRChEMBLRDKit机器学习生物信息学
Published 2026-05-27 12:43Recent activity 2026-05-27 12:52Estimated read 5 min
Drug-Target Interaction Prediction: A Comparative Study of Molecular Fingerprints and Graph Neural Networks
1

Section 01

Introduction: Core Summary of Comparative Study on Drug-Target Interaction Prediction Methods

This study focuses on activity prediction of Epidermal Growth Factor Receptor (EGFR) inhibitors. Based on data from the ChEMBL database, it compares two methods: Morgan molecular fingerprints + Random Forest and Graph Neural Networks (GNNs). A complete machine learning workflow is implemented using tools like RDKit, PyTorch Geometric, and SHAP. Results show that traditional methods perform better under the current data scale.

2

Section 02

Research Background: Importance of DTI Prediction and EGFR Target

Drug-Target Interaction (DTI) prediction is a core issue in drug discovery, which can accelerate new drug development and reduce experimental costs. This project targets EGFR inhibitor activity prediction; as a key target for cancer treatment, the development of EGFR inhibitors is of great significance for tumor therapy.

3

Section 03

Dataset Construction: Acquisition and Cleaning of ChEMBL Data

Data is sourced from the ChEMBL database (CHEMBL203), with 17,723 original records. After cleaning, 8,728 compounds are obtained (89% active, 11% inactive). Data features include an average pIC50 of 7.23 and an average molecular weight of 488.1 Da. The cleaning process: ChEMBL API acquisition → IC50 extraction → pIC50 conversion → binary label generation.

4

Section 04

Model Method Comparison: Differences Between Traditional Machine Learning and GNNs

Method 1: Morgan Fingerprints + Random Forest

  • Molecular representation: Morgan fingerprints generated by RDKit (radius 2, 2048-bit binary vector)
  • Model: Random Forest (200 trees, class_weight='balanced', 5-fold cross-validation)
  • Interpretability: SHAP analysis for feature importance

Method 2: Graph Neural Network

  • Molecular representation: Atoms as nodes (15-dimensional features), chemical bonds as edges
  • Model: 3-layer GCN, global average pooling, Adam optimizer, trained for 50 epochs
5

Section 05

Experimental Results: Analysis of Why Random Forest Outperforms GNN

Results show that the Random Forest has an ROC-AUC of 0.9694, while the GNN has 0.8887. Reasons include: 1. Class imbalance: RF uses class_weight to mitigate it, but GNN does not; 2. Data scale: 8,728 samples are small for GNN; 3. GNN training did not converge (AUC still increased after 50 epochs); 4. Morgan fingerprints are more effective in small to medium datasets.

6

Section 06

Tech Stack and Project Structure

Tools: RDKit (molecular processing), scikit-learn (RF implementation), PyTorch/PyTorch Geometric (GNN), ChEMBL API (data acquisition), SHAP (interpretability), etc. Project Structure: Includes directories like data (raw/processed), notebooks (6 phase scripts), src (function functions), models (model files), etc.

7

Section 07

Implications for Drug Discovery Research

  1. Method selection should consider data scale: traditional methods are better for small to medium datasets; 2. Class imbalance needs to be addressed: a common issue in biological activity data; 3. Interpretability is important: SHAP analysis guides compound design; 4. Value of complete workflow: conducive to reproducible research.
8

Section 08

Research Summary

This project provides a complete workflow for EGFR inhibitor activity prediction. Comparing the two methods, it is found that although GNNs are theoretically suitable for complex structure learning, Morgan fingerprints + Random Forest perform better under the current data. This suggests that researchers should choose methods based on practical problems rather than blindly pursuing new ones.