# Drug-Target Interaction Prediction: A Comparative Study of Molecular Fingerprints and Graph Neural Networks

> This project on EGFR inhibitor activity prediction based on the ChEMBL database compares two methods—Morgan molecular fingerprints + Random Forest and Graph Neural Networks (GNNs)—and implements a complete machine learning workflow using RDKit, PyTorch Geometric, and SHAP.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-27T04:43:11.000Z
- 最近活动: 2026-05-27T04:52:08.798Z
- 热度: 161.8
- 关键词: 药物发现, 药物-靶点相互作用, 分子指纹, 图神经网络, EGFR, ChEMBL, RDKit, 机器学习, 生物信息学
- 页面链接: https://www.zingnex.cn/en/forum/thread/geo-github-vishnuprabhauvaraj-dti-prediction
- Canonical: https://www.zingnex.cn/forum/thread/geo-github-vishnuprabhauvaraj-dti-prediction
- Markdown 来源: floors_fallback

---

## Introduction: Core Summary of Comparative Study on Drug-Target Interaction Prediction Methods

This study focuses on activity prediction of Epidermal Growth Factor Receptor (EGFR) inhibitors. Based on data from the ChEMBL database, it compares two methods: Morgan molecular fingerprints + Random Forest and Graph Neural Networks (GNNs). A complete machine learning workflow is implemented using tools like RDKit, PyTorch Geometric, and SHAP. Results show that traditional methods perform better under the current data scale.

## Research Background: Importance of DTI Prediction and EGFR Target

Drug-Target Interaction (DTI) prediction is a core issue in drug discovery, which can accelerate new drug development and reduce experimental costs. This project targets EGFR inhibitor activity prediction; as a key target for cancer treatment, the development of EGFR inhibitors is of great significance for tumor therapy.

## Dataset Construction: Acquisition and Cleaning of ChEMBL Data

Data is sourced from the ChEMBL database (CHEMBL203), with 17,723 original records. After cleaning, 8,728 compounds are obtained (89% active, 11% inactive). Data features include an average pIC50 of 7.23 and an average molecular weight of 488.1 Da. The cleaning process: ChEMBL API acquisition → IC50 extraction → pIC50 conversion → binary label generation.

## Model Method Comparison: Differences Between Traditional Machine Learning and GNNs

**Method 1: Morgan Fingerprints + Random Forest**
- Molecular representation: Morgan fingerprints generated by RDKit (radius 2, 2048-bit binary vector)
- Model: Random Forest (200 trees, class_weight='balanced', 5-fold cross-validation)
- Interpretability: SHAP analysis for feature importance

**Method 2: Graph Neural Network**
- Molecular representation: Atoms as nodes (15-dimensional features), chemical bonds as edges
- Model: 3-layer GCN, global average pooling, Adam optimizer, trained for 50 epochs

## Experimental Results: Analysis of Why Random Forest Outperforms GNN

Results show that the Random Forest has an ROC-AUC of 0.9694, while the GNN has 0.8887. Reasons include: 1. Class imbalance: RF uses class_weight to mitigate it, but GNN does not; 2. Data scale: 8,728 samples are small for GNN; 3. GNN training did not converge (AUC still increased after 50 epochs); 4. Morgan fingerprints are more effective in small to medium datasets.

## Tech Stack and Project Structure

**Tools**: RDKit (molecular processing), scikit-learn (RF implementation), PyTorch/PyTorch Geometric (GNN), ChEMBL API (data acquisition), SHAP (interpretability), etc.
**Project Structure**: Includes directories like data (raw/processed), notebooks (6 phase scripts), src (function functions), models (model files), etc.

## Implications for Drug Discovery Research

1. Method selection should consider data scale: traditional methods are better for small to medium datasets; 2. Class imbalance needs to be addressed: a common issue in biological activity data; 3. Interpretability is important: SHAP analysis guides compound design; 4. Value of complete workflow: conducive to reproducible research.

## Research Summary

This project provides a complete workflow for EGFR inhibitor activity prediction. Comparing the two methods, it is found that although GNNs are theoretically suitable for complex structure learning, Morgan fingerprints + Random Forest perform better under the current data. This suggests that researchers should choose methods based on practical problems rather than blindly pursuing new ones.