# MNIST Handwritten Digit Recognition: Comparative Experimental Study of ReLU and Sigmoid Activation Functions

> This study compares the performance of ReLU and Sigmoid activation functions in neural networks using the MNIST dataset, revealing ReLU's advantages in convergence speed and gradient propagation, and providing intuitive references for deep learning beginners to choose activation functions.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-06-09T16:15:48.000Z
- 最近活动: 2026-06-09T16:20:41.558Z
- 热度: 161.9
- 关键词: MNIST, 激活函数, ReLU, Sigmoid, 神经网络, 深度学习入门, 梯度消失, 手写数字识别, 机器学习实验
- 页面链接: https://www.zingnex.cn/en/forum/thread/mnist-relusigmoid
- Canonical: https://www.zingnex.cn/forum/thread/mnist-relusigmoid
- Markdown 来源: floors_fallback

---

## 【Introduction】Comparative Study of ReLU and Sigmoid Activation Functions in MNIST Handwritten Digit Recognition

This study compares the performance of ReLU and Sigmoid activation functions in neural networks using the MNIST dataset, revealing ReLU's advantages in convergence speed and gradient propagation, and providing intuitive references for deep learning beginners to choose activation functions.

Original Author/Maintainer: Tayyabah-Rehman
Source Platform: GitHub
Original Project: Simple-Neural-Network-Development-for-Digit-Classification
Release Time: 2026-06-09
Original Link: https://github.com/Tayyabah-Rehman/Simple-Neural-Network-Development-for-Digit-Classification

## Project Background and Significance

In the development of deep learning, the choice of activation function is crucial to neural network performance. Sigmoid was commonly used in the early days, but the gradient vanishing problem was prominent in deep networks; the introduction of ReLU in 2011 changed the situation and became a standard configuration in modern deep learning.

This project uses MNIST handwritten digit recognition (70,000 28x28 pixel images) as a benchmark task to compare the performance of the two activation functions, not only showing the difference in accuracy but also revealing the essential differences in training dynamics and gradient propagation.

## Principle Comparison of Activation Functions: Sigmoid vs. ReLU

### Sigmoid Function
Mathematical expression: \(\sigma(x) = \frac{1}{1 + e^{-x}}\), maps to the (0,1) interval. Advantages: output can be interpreted as probability, differentiable everywhere; Defects: gradient vanishing (derivative approaches 0 when input is far from the origin, gradient decays exponentially in deep networks), non-zero centered (output is always positive, affecting convergence efficiency).

### ReLU Function
Mathematical expression: \(f(x) = max(0, x)\), piecewise linear design; Advantages: efficient computation (no exponential operations), alleviates gradient vanishing (gradient is always 1 for positive input), sparse activation (negative input set to zero improves generalization); Limitations: neurons "die" when negative input persists, which can be solved by variants like Leaky ReLU.

## Experimental Design and Network Architecture

A feedforward neural network architecture is used: the input layer receives 784-dimensional flattened image pixels, the hidden layer performs non-linear transformation, and the output layer generates a probability distribution of 10 categories.

The core of the experiment is a controlled variable comparison: keep hyperparameters such as network structure, optimizer, learning rate, and batch size consistent, only change the type of activation function in the hidden layer to ensure that performance differences come from the activation function itself. Loss curves and validation set accuracy are recorded during training.

## Experimental Result Analysis: Performance Advantages of ReLU

1. **Convergence Speed**: ReLU network reaches high accuracy in fewer epochs because the identity gradient allows effective backpropagation of errors; Sigmoid network progresses slowly and requires more iterations to converge.
2. **Gradient Flow**: Gradient vanishing is prominent in deep Sigmoid networks (max derivative per layer is 0.25, leading to sharp attenuation when stacked across layers); ReLU's unit gradient in the positive interval ensures stable gradient propagation, making deep training possible.
3. **Final Performance**: Both achieve high accuracy in the MNIST task, but ReLU achieves the same or better performance with less training time; the gap is larger in complex datasets or deep networks.

## Practical Insights and Best Practice Recommendations

1. **Default Choice of ReLU**: Prioritize ReLU for hidden layers unless there are special needs;
2. **Exception for Output Layer**: Use Sigmoid for binary classification, and Softmax for multi-classification;
3. **Adjust Learning Rate**: ReLU's identity gradient leads to larger parameter update magnitudes, so the learning rate needs to be appropriately reduced;
4. **Handle Dead ReLU**: If a large number of neurons are inactivated, try Leaky ReLU or adjust initialization strategies.

## Educational Value and Learning Path

For deep learning beginners, this project is an excellent entry case:
- Intuitively understand the impact of activation functions on network behavior;
- Master the experimental design idea of the controlled variable method;
- Establish a perceptual understanding of the gradient vanishing problem;
- Understand the reasons why ReLU becomes mainstream.

The Jupyter Notebook format is convenient for step-by-step execution and modification, and it is encouraged to try different configurations to deepen understanding.

## Conclusion: The Inevitability of ReLU Becoming Mainstream

Although the MNIST project is small, it reveals core design decisions in deep learning. The popularity of ReLU stems from the fundamental solution to the gradient propagation problem. For developers who want to deeply understand the mechanism of neural networks, reproducing the comparative experiment is an irreplaceable learning experience.