Zing Forum

Reading

MNIST Handwritten Digit Recognition: Comparative Experimental Study of ReLU and Sigmoid Activation Functions

This study compares the performance of ReLU and Sigmoid activation functions in neural networks using the MNIST dataset, revealing ReLU's advantages in convergence speed and gradient propagation, and providing intuitive references for deep learning beginners to choose activation functions.

MNIST激活函数ReLUSigmoid神经网络深度学习入门梯度消失手写数字识别机器学习实验
Published 2026-06-10 00:15Recent activity 2026-06-10 00:20Estimated read 8 min
MNIST Handwritten Digit Recognition: Comparative Experimental Study of ReLU and Sigmoid Activation Functions
1

Section 01

【Introduction】Comparative Study of ReLU and Sigmoid Activation Functions in MNIST Handwritten Digit Recognition

This study compares the performance of ReLU and Sigmoid activation functions in neural networks using the MNIST dataset, revealing ReLU's advantages in convergence speed and gradient propagation, and providing intuitive references for deep learning beginners to choose activation functions.

Original Author/Maintainer: Tayyabah-Rehman Source Platform: GitHub Original Project: Simple-Neural-Network-Development-for-Digit-Classification Release Time: 2026-06-09 Original Link: https://github.com/Tayyabah-Rehman/Simple-Neural-Network-Development-for-Digit-Classification

2

Section 02

Project Background and Significance

In the development of deep learning, the choice of activation function is crucial to neural network performance. Sigmoid was commonly used in the early days, but the gradient vanishing problem was prominent in deep networks; the introduction of ReLU in 2011 changed the situation and became a standard configuration in modern deep learning.

This project uses MNIST handwritten digit recognition (70,000 28x28 pixel images) as a benchmark task to compare the performance of the two activation functions, not only showing the difference in accuracy but also revealing the essential differences in training dynamics and gradient propagation.

3

Section 03

Principle Comparison of Activation Functions: Sigmoid vs. ReLU

Sigmoid Function

Mathematical expression: (\sigma(x) = \frac{1}{1 + e^{-x}}), maps to the (0,1) interval. Advantages: output can be interpreted as probability, differentiable everywhere; Defects: gradient vanishing (derivative approaches 0 when input is far from the origin, gradient decays exponentially in deep networks), non-zero centered (output is always positive, affecting convergence efficiency).

ReLU Function

Mathematical expression: (f(x) = max(0, x)), piecewise linear design; Advantages: efficient computation (no exponential operations), alleviates gradient vanishing (gradient is always 1 for positive input), sparse activation (negative input set to zero improves generalization); Limitations: neurons "die" when negative input persists, which can be solved by variants like Leaky ReLU.

4

Section 04

Experimental Design and Network Architecture

A feedforward neural network architecture is used: the input layer receives 784-dimensional flattened image pixels, the hidden layer performs non-linear transformation, and the output layer generates a probability distribution of 10 categories.

The core of the experiment is a controlled variable comparison: keep hyperparameters such as network structure, optimizer, learning rate, and batch size consistent, only change the type of activation function in the hidden layer to ensure that performance differences come from the activation function itself. Loss curves and validation set accuracy are recorded during training.

5

Section 05

Experimental Result Analysis: Performance Advantages of ReLU

  1. Convergence Speed: ReLU network reaches high accuracy in fewer epochs because the identity gradient allows effective backpropagation of errors; Sigmoid network progresses slowly and requires more iterations to converge.
  2. Gradient Flow: Gradient vanishing is prominent in deep Sigmoid networks (max derivative per layer is 0.25, leading to sharp attenuation when stacked across layers); ReLU's unit gradient in the positive interval ensures stable gradient propagation, making deep training possible.
  3. Final Performance: Both achieve high accuracy in the MNIST task, but ReLU achieves the same or better performance with less training time; the gap is larger in complex datasets or deep networks.
6

Section 06

Practical Insights and Best Practice Recommendations

  1. Default Choice of ReLU: Prioritize ReLU for hidden layers unless there are special needs;
  2. Exception for Output Layer: Use Sigmoid for binary classification, and Softmax for multi-classification;
  3. Adjust Learning Rate: ReLU's identity gradient leads to larger parameter update magnitudes, so the learning rate needs to be appropriately reduced;
  4. Handle Dead ReLU: If a large number of neurons are inactivated, try Leaky ReLU or adjust initialization strategies.
7

Section 07

Educational Value and Learning Path

For deep learning beginners, this project is an excellent entry case:

  • Intuitively understand the impact of activation functions on network behavior;
  • Master the experimental design idea of the controlled variable method;
  • Establish a perceptual understanding of the gradient vanishing problem;
  • Understand the reasons why ReLU becomes mainstream.

The Jupyter Notebook format is convenient for step-by-step execution and modification, and it is encouraged to try different configurations to deepen understanding.

8

Section 08

Conclusion: The Inevitability of ReLU Becoming Mainstream

Although the MNIST project is small, it reveals core design decisions in deep learning. The popularity of ReLU stems from the fundamental solution to the gradient propagation problem. For developers who want to deeply understand the mechanism of neural networks, reproducing the comparative experiment is an irreplaceable learning experience.