Zing Forum

Reading

Fusing Expert Knowledge with Graph Neural Networks: An Exploration of Collaborative Learning for Molecular Water Solubility Prediction

An AI4Science study that explores the synergistic effect between traditional chemical descriptors and graph neural networks (GNNs) in molecular water solubility prediction by comparing Random Forest, XGBoost, MLP, GNN, and hybrid GNN models.

AI4Science分子水溶性预测图神经网络专家描述符化学信息学RDKitPyTorch Geometric特征融合
Published 2026-05-02 10:13Recent activity 2026-05-02 10:23Estimated read 5 min
Fusing Expert Knowledge with Graph Neural Networks: An Exploration of Collaborative Learning for Molecular Water Solubility Prediction
1

Section 01

[Introduction] Fusing Expert Knowledge with GNNs: An Exploration of Collaborative Learning for Molecular Water Solubility Prediction

This study focuses on the AI4Science field, exploring the synergistic effect between traditional chemical descriptors and graph neural networks (GNNs) in molecular water solubility prediction. By comparing Random Forest, XGBoost, MLP, GNN, and hybrid GNN models, it was found that the hybrid architecture fusing expert knowledge and GNNs maintains stable performance across the entire solubility range, demonstrating the value of combining domain knowledge with data-driven methods.

2

Section 02

Research Background and Core Questions

In drug discovery and materials science, molecular water solubility is a key indicator for evaluating the druggability of compounds. Traditional prediction relies on expert-designed physicochemical descriptors, while GNNs show significant potential in molecular representation learning. Core questions: Can traditional chemical knowledge and GNN-based automatic representation learning produce synergistic effects? That is, is the combination of the two better than using either method alone?

3

Section 03

Dataset and Feature Engineering

The classic ESOL (Delaney) dataset was used, and three types of features were constructed via RDKit:

  1. Graph features: Atomic number, degree, aromaticity flag, hybridization type (Max-Min normalized);
  2. Expert descriptors: MolLogP (lipophilicity-water partition coefficient), TPSA (topological polar surface area), molecular weight, number of valence electrons (Max-Min normalized).
4

Section 04

Model Architecture and Comparative Experiments

Five models were designed for comparison:

  1. Baseline models: Random Forest, XGBoost (based on 1D expert features);
  2. MLP: Fully connected network (learns nonlinear combinations of descriptors, no structural information);
  3. GNN: Graph Convolutional Network (captures molecular topological structure and atomic interactions);
  4. Hybrid GNN: Fuses GCN graph embeddings with physicochemical features, combining structural awareness and global insights.
5

Section 05

Key Findings: Evidence of Synergistic Effects

  • Limitations of pure GNNs: High error for low-solubility molecules (log S <0) due to lack of global hydrophobicity features;
  • Limitations of expert models: Large error for high-solubility molecules (log S >0) as descriptors lack structural granularity;
  • Advantages of hybrid GNNs: Expert descriptors provide physical baselines, GNNs capture structural details, resulting in the best robustness across the entire range.
6

Section 06

Tech Stack and Implementation Details

Toolchain used:

  • Cheminformatics: RDKit (molecular feature extraction);
  • Deep learning: PyTorch, PyTorch Geometric (GNN implementation);
  • Machine learning: Scikit-learn, XGBoost (traditional models);
  • Data processing: Pandas, NumPy, Matplotlib (visualization).
7

Section 07

Implications and Outlook

  1. Fusion of domain knowledge and data-driven approaches is more effective; expert features provide physical constraints;
  2. Multimodal feature fusion is a future direction;
  3. Regional performance analysis is more important than a single metric; Recommendation: Practitioners should combine traditional knowledge with machine learning techniques to enhance scientific discovery capabilities.