Forward Propagation
Data flows from the input layer to the output layer layer by layer. Each layer performs matrix operations and activation functions. For classification tasks, the output layer often uses Softmax to convert to a probability distribution.
Loss Function
Cross-entropy loss is used to measure the difference between the predicted probability distribution and the true labels. When the prediction is wrong and the confidence is high, the loss value is large, providing a strong correction signal.
Backpropagation
Use the chain rule to calculate the gradient of the loss with respect to each parameter, including gradient calculation for weights, biases, and inputs. When Softmax is combined with cross-entropy, the derivative simplifies to ∂L/∂zi = (pi − ti) / N.
Gradient Descent
Parameter updates follow Wnew = Wold − η × ∇WL. The learning rate must be chosen appropriately (too large leads to oscillation and divergence, too small leads to slow training). Batch learning is supported to stabilize gradient estimation.