Zing Forum

Reading

Deep Understanding of Neural Network Activation Functions: Re-examining Initialization and Optimization from the Gradient Flow Perspective

This article presents a systematic experimental study on neural network activation functions. Using a multi-layer perceptron (MLP) implemented purely with NumPy, it deeply analyzes the gradient flow dynamics, saturation phenomena, and optimization behaviors of ReLU, tanh, arctan, and softsign under different Xavier initialization scales, revealing the importance of evaluating activation function selection in conjunction with initialization strategies.

神经网络激活函数梯度流Xavier初始化ReLUtanh深度学习优化动态NumPy机器学习
Published 2026-04-30 09:45Recent activity 2026-04-30 10:24Estimated read 5 min
Deep Understanding of Neural Network Activation Functions: Re-examining Initialization and Optimization from the Gradient Flow Perspective
1

Section 01

Introduction: A Study on Joint Evaluation of Activation Functions and Initialization Strategies

This article systematically analyzes the gradient flow dynamics, saturation phenomena, and optimization behaviors of four activation functions (ReLU, tanh, arctan, and softsign) under different Xavier initialization scales using an MLP implemented purely with NumPy. It reveals the importance of evaluating activation function selection in conjunction with initialization strategies, rather than relying solely on the final accuracy metric.

2

Section 02

Research Background and Motivation

Traditional methods of evaluating activation functions only focus on final accuracy, ignoring differences in optimization dynamics under different initialization conditions. For example, ReLU may have dead neurons, and tanh may converge slowly. Gradient flow is the lifeline of training, and the derivative characteristics of activation functions directly affect the health of gradient flow. Therefore, it is necessary to understand gradient flow dynamics under different conditions.

3

Section 03

Experimental Design Details

The experiment uses NumPy to implement the MLP entirely. Its advantages include full controllability, deep understanding of principles, and fine monitoring of internal states. The experiment is conducted on a noisy XOR classification task, comparing four activation functions (ReLU, tanh, arctan, softsign) and three Xavier initialization scales (0.5, 1.0, 2.0). Each combination is repeated 10 times to ensure statistical reliability.

4

Section 04

Key Findings: Critical Impact of Initialization Scales

Under small-scale initialization (0.5), ReLU converges reliably, while bounded functions learn slowly with high variance across seeds. Under medium/large scales (1.0/2.0), the final accuracy is similar, but tanh has healthier gradient flow at large scales. Initialization scales significantly affect saturation and dead neuron phenomena: bounded functions are prone to saturation at large scales, and different functions show varying performances.

5

Section 05

Diagnostic Indicator System

The project designed a comprehensive diagnostic indicator system, including derivative-related indicators (e.g., dphi_small_rate, dphi_mean_abs), pre-activation value indicators (e.g., z_mean_abs, z_near0_rate), and gradient statistical indicators (e.g., grad_norm_L1/L2). These help to gain insight into the internal state of training and identify issues such as gradient vanishing.

6

Section 06

Practical Insights and Recommendations

  1. Evaluate activation functions and initialization strategies jointly; 2. Focus on optimization dynamics rather than just end-point accuracy; 3. Choose based on scenarios: ReLU for fast convergence at small scales, tanh for stable gradient flow at large scales; 4. Implement comprehensive monitoring (gradient norm, activation distribution, etc.).
7

Section 07

Conclusions and Future Directions

Research conclusions: Activation functions need to be evaluated jointly with initialization, focusing on gradient flow and optimization dynamics. Limitations: Shallow networks, simple tasks, full-batch gradient descent. Future directions can extend to deep networks, complex datasets (e.g., MNIST), and different optimizers (e.g., Adam).