1Cademy - Intuition behind Gradient Descent with Momentum

Learn Before

Gradient Descent with Momentum

Concept

Intuition behind Gradient Descent with Momentum

The core intuition behind momentum is that averaging past gradients produces smoother, more effective updates. In a direction where the objective function has consistent curvature (such as the $x_1$ direction on an elongated quadratic), successive gradients are well-aligned, so their running average preserves a large step in that direction—accelerating progress toward the minimum. Conversely, in a direction where the gradient oscillates between positive and negative values on consecutive steps (such as the $x_2$ direction on the same elongated quadratic), the averaged gradient becomes small because the opposing contributions cancel each other out. This selective dampening of oscillations and amplification of consistent descent is what allows momentum to navigate ill-conditioned landscapes far more efficiently than plain gradient descent.