Learn Before
Concept

The Problem with Constant Initialization

If all parameters of a hidden layer are initialized to a constant, such as W(1)=c\mathbf{W}^{(1)} = c, every hidden unit will receive the same inputs and parameters, producing identical activations during forward propagation. Consequently, during backpropagation, the gradients of the output with respect to the parameters W(1)\mathbf{W}^{(1)} will all take the exact same value. Because gradient-based algorithms like minibatch stochastic gradient descent update the parameters using these uniform gradients, all elements of W(1)\mathbf{W}^{(1)} will continue to have identical values after every iteration. The hidden layer will thus behave as if it has only a single unit, failing to realize the network's expressive power.

0

1

Updated 2026-05-06

Tags

Data Science

D2L

Dive into Deep Learning @ D2L