Learn Before
Intuition behind Gradient Descent with Momentum
The core intuition behind momentum is that averaging past gradients produces smoother, more effective updates. In a direction where the objective function has consistent curvature (such as the direction on an elongated quadratic), successive gradients are well-aligned, so their running average preserves a large step in that direction—accelerating progress toward the minimum. Conversely, in a direction where the gradient oscillates between positive and negative values on consecutive steps (such as the direction on the same elongated quadratic), the averaged gradient becomes small because the opposing contributions cancel each other out. This selective dampening of oscillations and amplification of consistent descent is what allows momentum to navigate ill-conditioned landscapes far more efficiently than plain gradient descent.
0
3
Contributors are:
Who are from:
Tags
Data Science
D2L
Dive into Deep Learning @ D2L
Related
Intuition behind Gradient Descent with Momentum
These plots were generated with gradient descent; with gradient descent with momentum (β = 0.5) and gradient descent with momentum (β = 0.9). Which curve corresponds to which algorithm?
Adam (Deep Learning Optimization Algorithm)
Origin of the Momentum Method
Velocity Initialization in Momentum Method
Momentum Convergence on a Scalar Quadratic
Gradient Descent with Momentum Pseudocode