When gradients explode during neural network training, the gradient norm $$\|\mathbf{g}\|$$ becomes excessively large. In such worst-case scenarios, a single gradient step can undo the progress made over the course of thousands of training iterations. Consequently, training often diverges, entirely failing to reduce the value of the objective function. Even in cases where training eventually converges, the process remains highly unstable due to massive spikes in the loss.

Claude

In a neural network with many time steps or layers, a gradient at the early layer is the product of all the terms from the later layers, which leads to an inherently unstable situation. Especially when the value of gradient has become so small, it no longer updates properly or is vanished eventually. Exploding gradient can be considered as the opposite of vanishing process. The updated weights using gradient descent become so large that they cause the whole network to become unstable, which leads to numerical overflow.

Vanishing/exploding gradient

Dive into Deep Learning

- Identity RNN with ReLU activation (solving vanishing gradient problem only)
- Gradient clipping
- Skip connections
- LSTM
- GRU

Solutions for vanishing/exploding gradient

A webpage explaining exploding gradients

https://machinelearningmastery.com/exploding-gradients-in-neural-networks/

A Gentle Introduction to Exploding Gradients in Neural Networks

If the weight matrix $$\mathbf{W}$$ of a neural network layer is initialized to all zeros, the gradient of the loss $$\mathcal{L}$$ with respect to the pre-activation vector, $$\frac{\partial\mathcal{L}}{\partial\mathbf{z}}$$, will be identical for every neuron in that layer (assuming identical biases). During gradient descent, these parameters will update identically, preventing the neurons from learning distinct features and causing the symmetry problem. Furthermore, because the weight matrix is zero, backpropagating the gradient to earlier layers involves multiplication by $$\mathbf{W}^T$$, which immediately zeroes out those gradients and contributes directly to the vanishing gradient problem.

Zero Weight Initialization in Feed-Forward Networks

Impact of Exploding Gradients on Model Training

When minimizing an objective function using the hyperbolic tangent ($$	anh$$) activation function, optimization can stall due to the vanishing gradient problem. For example, if an algorithm attempts to minimize $$f(x) = 	anh(x)$$ starting at $$x = 4$$, the gradient is extremely small. Since the derivative is $$f'(x) = 1 - 	anh^2(x)$$, the gradient evaluates to $$f'(4) = 0.0013$$. Consequently, the optimization process gets stuck and makes negligible progress for a long time. This severe saturation issue is one of the primary reasons training deep learning models was notoriously tricky before the widespread adoption of the ReLU activation function.

Vanishing Gradient of the Tanh Activation Function

When optimization in deep neural networks stalls—frequently a consequence of vanishing gradients—a common mitigation strategy is reparametrization. This approach involves altering the mathematical formulation of the problem to create a more favorable loss landscape, thereby allowing optimization algorithms to resume making progress.

Reparametrization to Mitigate Stalling Optimization

Vanishing and exploding gradients are common problems in recurrent neural networks. Consider a network where an input is multiplied by a weight matrix $$ \mathbf{W} $$ for $$ t $$ time steps. Let $$ \mathbf{W}^t $$ have the eigendecomposition $$ \mathbf{V} \Lambda^t \mathbf{V}^{-1} $$, where $$ \Lambda $$ is a diagonal matrix of eigenvalues. We can see that if an eigenvalue $$ \lambda > 1 $$, the result will approach $$ \infty $$ as $$ t $$ gets large, leading to an exploding gradient. Conversely, if $$ \lambda < 1 $$, the result will approach $$ 0 $$ as $$ t $$ gets large, resulting in a vanishing gradient.

Learn Before

Related