Zing Forum

Reading

CaPFAS: An Interpretable Multimodal Neural Network-Based Comprehensive Analysis Framework for Per- and Polyfluoroalkyl Substances (PFAS)

CaPFAS is an open-source framework designed specifically for PFAS (per- and polyfluoroalkyl substances) analysis. It integrates data cleaning, preprocessing, and model training functions, adopts an interpretable multimodal neural network architecture, and provides an end-to-end machine learning solution for environmental science and toxicology research.

PFAS多模态神经网络可解释AI环境化学毒理学机器学习数据清洗图神经网络分子预测环境风险评估
Published 2026-06-16 15:44Recent activity 2026-06-16 15:54Estimated read 8 min
CaPFAS: An Interpretable Multimodal Neural Network-Based Comprehensive Analysis Framework for Per- and Polyfluoroalkyl Substances (PFAS)
1

Section 01

Introduction to the CaPFAS Framework: An Interpretable Multimodal Neural Network Solution for PFAS Analysis

CaPFAS is an open-source framework developed by the Fu Research Group at the State Key Laboratory of Environmental Chemistry and Ecotoxicology, Henan Academy of Sciences (HIAS-RCEES-FuLab), designed specifically for PFAS (per- and polyfluoroalkyl substances) analysis. This framework integrates data cleaning, preprocessing, and model training functions, adopts an interpretable multimodal neural network architecture, and provides an end-to-end machine learning solution to support environmental science and toxicology research. Project open-source address: GitHub, released on June 16, 2026.

2

Section 02

Background: Complex Challenges in PFAS Analysis

PFAS are known as 'forever chemicals' due to their environmental persistence, bioaccumulation, and potential toxicity. Thousands of variants have been detected globally, posing significant challenges to environmental monitoring and risk assessment. Traditional analysis methods face three major problems: 1. Large differences in multi-source data formats and quality; 2. High complexity of toxicity mechanisms involving multiple targets and pathways; 3. Lack of interpretability in existing machine learning models, which makes it difficult to meet the transparency requirements of scientific research and regulation. The field urgently needs an integrated framework that can consolidate multi-source data and provide interpretable predictions.

3

Section 03

Core Features and Design Philosophy of the CaPFAS Framework

The core goal of CaPFAS is to provide an end-to-end solution for PFAS data analysis and prediction. Its design emphasizes three key features: 1. Multimodal Data Fusion: Uniformly process structured data (physicochemical properties, concentration values) and unstructured data (molecular structures, mass spectra); 2. Interpretability First: Reveal key features and mechanisms of predictions through interpretable architectures; 3. End-to-End Automation: Cover the complete workflow from data cleaning and preprocessing to model training, lowering the barrier to use.

4

Section 04

Analysis of the Core Technical Architecture of CaPFAS

Data Cleaning and Preprocessing Module

A dedicated pipeline is built-in to handle missing values, outliers, inconsistent units, etc. It supports feature engineering such as molecular descriptor calculation and physicochemical property standardization.

Multimodal Neural Network

  • Molecular Structure Modality: Encode topological structures using Graph Neural Networks (GNN) or molecular fingerprints;
  • Physicochemical Property Modality: Process numerical features like molecular weight and LogP;
  • Text Description Modality: Analyze literature information using NLP;
  • The fusion layer uses an attention mechanism to balance the contributions of each modality and provide explanations.

Interpretability Mechanisms

Includes feature importance analysis, attention visualization, counterfactual explanations, and is compatible with the SHAP framework to provide game-theoretic feature attribution.

5

Section 05

Application Scenarios and Practical Value of CaPFAS

  1. Toxicity Prediction and Risk Assessment: Build models to predict acute/chronic toxicity of PFAS; interpretable outputs identify key toxic structures, guiding the development of safer-by-design alternatives;
  2. Environmental Fate Simulation: Predict parameters such as bioconcentration factors and soil adsorption coefficients, supplementing experimental data to support exposure assessment;
  3. High-Throughput Screening and Prioritization: Quickly identify high-risk compounds, providing support for regulatory prioritization and resource allocation.
6

Section 06

Technical Implementation and User Guide for CaPFAS

CaPFAS is implemented based on Python, relying on the PyTorch deep learning framework and RDKit chemical toolkit. Users can define tasks (data paths, hyperparameters, etc.) through configuration files, and it supports two usage modes: command-line interface and Python API. The framework's modular design facilitates expansion: custom preprocessing steps can be inserted, network architectures replaced, or new interpretation methods integrated.

7

Section 07

Project Significance and Future Outlook of CaPFAS

CaPFAS fills the gap of specialized machine learning tools in the PFAS field. Compared to general-purpose tools, it is optimized for PFAS data, and its open-source nature ensures transparency and auditability (critical for regulatory science). As global PFAS regulation strengthens, the demand for such tools will continue to grow. It not only provides practical tools for current research but also lays the foundation for future PFAS knowledge graph construction and AI-assisted toxicology development, making it suitable for professionals in fields like environmental chemistry and toxicology.