When a neural network's hidden layer parameters are initialized to a constant value, standard gradient-based iteration algorithms, such as minibatch stochastic gradient descent, update the parameters uniformly and cannot break the resulting parameter symmetry on their own. However, applying dropout regularization is capable of breaking this symmetry, allowing the network to overcome the limitations of uniform weights and eventually realize its full expressive power.

Claude

If all parameters of a hidden layer are initialized to a constant, such as $$\mathbf{W}^{(1)} = c$$, every hidden unit will receive the same inputs and parameters, producing identical activations during forward propagation. Consequently, during backpropagation, the gradients of the output with respect to the parameters $$\mathbf{W}^{(1)}$$ will all take the exact same value. Because gradient-based algorithms like minibatch stochastic gradient descent update the parameters using these uniform gradients, all elements of $$\mathbf{W}^{(1)}$$ will continue to have identical values after every iteration. The hidden layer will thus behave as if it has only a single unit, failing to realize the network's expressive power.

Learn Before

Related