Learn Before
Vanishing Gradient of the Tanh Activation Function
When minimizing an objective function using the hyperbolic tangent () activation function, optimization can stall due to the vanishing gradient problem. For example, if an algorithm attempts to minimize starting at , the gradient is extremely small. Since the derivative is , the gradient evaluates to . Consequently, the optimization process gets stuck and makes negligible progress for a long time. This severe saturation issue is one of the primary reasons training deep learning models was notoriously tricky before the widespread adoption of the ReLU activation function.
0
1
Tags
D2L
Dive into Deep Learning @ D2L
Related
Solutions for vanishing/exploding gradient
A Gentle Introduction to Exploding Gradients in Neural Networks
Zero Weight Initialization in Feed-Forward Networks
Impact of Exploding Gradients on Model Training
Vanishing Gradient of the Tanh Activation Function
Reparametrization to Mitigate Stalling Optimization
Mathematical Mechanism of Vanishing and Exploding Gradients in Recurrent Neural Networks