Zing Forum

Reading

Building a Feedforward Neural Network from Scratch: A Deep Learning Practice for Protein Folding State Classification

Implement a complete feedforward neural network from scratch using only NumPy, perform three-class classification (folded/intermediate/unfolded) on molecular dynamics simulation data of the Trp-cage mini-protein, and gain an in-depth understanding of the mathematical principles behind neural networks.

前馈神经网络蛋白质折叠分子动力学NumPy从零实现深度学习Trp-cageRMSDETE生物信息学
Published 2026-06-15 12:13Recent activity 2026-06-15 12:31Estimated read 7 min
Building a Feedforward Neural Network from Scratch: A Deep Learning Practice for Protein Folding State Classification
1

Section 01

Introduction: Building a Feedforward Neural Network from Scratch for Protein Folding State Classification

This project is developed and maintained by ptan123, released on GitHub (Project title: FFNN_Project, Link: https://github.com/ptan123/FFNN_Project, Release date: June 15, 2026). The core content is to implement a feedforward neural network from scratch using only NumPy, perform three-class classification (folded, intermediate, unfolded states) on molecular dynamics simulation data of the Trp-cage mini-protein, aiming to gain an in-depth understanding of the mathematical principles behind neural networks.

2

Section 02

Project Background and Scientific Significance

Protein folding is a core problem in biochemistry; its three-dimensional structure determines function, stability, and interactions, which is of great significance for drug design and disease mechanism research. Trp-cage (composed of 20 amino acids) is an ideal model for folding research, as it can generate large amounts of conformational data via molecular dynamics simulations, but classifying different conformational states poses challenges. Characteristics of the three states: folded state (low RMSD, low ETE, functional state), intermediate state (low ETE, high RMSD, non-functional), unfolded state (high RMSD, high ETE, non-functional).

3

Section 03

Technical Objectives and Dataset Description

This project is a practice for the CH610 machine learning course, with the goal of building a fully functional feedforward neural network from scratch (using only NumPy, no reliance on advanced frameworks). Reasons for choosing to implement from scratch: modern frameworks encapsulate underlying principles, and implementing from scratch allows for an in-depth understanding of core components such as forward propagation, activation functions, backpropagation, loss functions, and optimization algorithms. The dataset is Trp-cage simulation data, with features including RMSD (measures the deviation of a conformation from the reference structure) and ETE (end-to-end distance, reflects compactness).

4

Section 04

Neural Network Architecture and Core Algorithms

Architecture design: The input layer receives 2-dimensional features (RMSD, ETE); the hidden layer uses the ReLU activation function (max(0,x)); the output layer uses softmax for three-class classification. Core algorithms: Forward propagation (input → hidden layer → output layer → softmax); loss function is cross-entropy (measures the difference between predicted and true distributions); labels use one-hot encoding; backpropagation calculates gradients via the chain rule; gradient descent is used to update parameters (weights, biases).

5

Section 05

Model Evaluation and Implementation Trade-offs

Evaluation methods: Test accuracy (basic metric), learning curve (judge convergence/overfitting), confusion matrix (identify class confusion), decision boundary visualization (intuitively display classification rules). Advantages of NumPy implementation: High transparency, great educational value, strong flexibility, lightweight; Challenges: No GPU acceleration, lack of advanced features (e.g., batch normalization), high debugging difficulty, no production environment functions (model saving/loading).

6

Section 06

Scientific Value and Extension Directions

Scientific value: Demonstrates the cross-integration of machine learning and biochemistry, provides tools for analyzing large-scale molecular simulation data, and verifies model rationality based on physical principles (physical meaning of RMSD/ETE). Extension directions: Introduce more features (radius of gyration, contact map), try complex architectures (CNN/RNN), expand to larger protein systems, implement uncertainty quantification (Bayesian neural networks), apply active learning (intelligent sampling).

7

Section 07

Summary and Key Insights

This project is an excellent teaching case that reflects the value of interdisciplinary research. The key takeaway is not the classification accuracy, but the in-depth understanding of neural network principles, which is the foundation for the rational application of machine learning in scientific research. The open-source implementation provides a reference for learners; although mature frameworks are needed in production environments, mastering the underlying principles is a necessary path to becoming an excellent machine learning practitioner.