Learn Before
Concept

Intuition behind Gradient Descent with Momentum

The core intuition behind momentum is that averaging past gradients produces smoother, more effective updates. In a direction where the objective function has consistent curvature (such as the x1x_1 direction on an elongated quadratic), successive gradients are well-aligned, so their running average preserves a large step in that direction—accelerating progress toward the minimum. Conversely, in a direction where the gradient oscillates between positive and negative values on consecutive steps (such as the x2x_2 direction on the same elongated quadratic), the averaged gradient becomes small because the opposing contributions cancel each other out. This selective dampening of oscillations and amplification of consistent descent is what allows momentum to navigate ill-conditioned landscapes far more efficiently than plain gradient descent.

Image 0

0

3

Updated 2026-05-15

Tags

Data Science

D2L

Dive into Deep Learning @ D2L