On iteration $$t$$: Compute $$dW$$, $$db$$ on the current mini-batch

$$v_{dW} = \beta v_{dW} + (1-\beta)dW$$

$$v_{db} = \beta v_{db} + (1-\beta)db$$

$$W = W - \alpha v_{dW}, b = b - \alpha v_{db}$$

Note that now we have two parameters $$\alpha$$ and $$\beta$$.

University of Michigan - Ann Arbor

Google

The basic idea of gradient descent with momentum is to compute an exponentially weighted average of your gradients, and then use that gradient to update your weights instead. It almost always works faster than the standard gradient descent algorithm.


Gradient Descent with Momentum

The core intuition behind momentum is that averaging past gradients produces smoother, more effective updates. In a direction where the objective function has consistent curvature (such as the $$x_1$$ direction on an elongated quadratic), successive gradients are well-aligned, so their running average preserves a large step in that direction—accelerating progress toward the minimum. Conversely, in a direction where the gradient oscillates between positive and negative values on consecutive steps (such as the $$x_2$$ direction on the same elongated quadratic), the averaged gradient becomes small because the opposing contributions cancel each other out. This selective dampening of oscillations and amplification of consistent descent is what allows momentum to navigate ill-conditioned landscapes far more efficiently than plain gradient descent.

Intuition behind Gradient Descent with Momentum

These plots were generated with gradient descent; with gradient descent with momentum (β = 0.5) and gradient descent with momentum (β = 0.9). Which curve corresponds to which algorithm?

Adam stands for adaptive moment estimation.
It combines gradient descent with momentum, and RMSProp. It brings the benefits from both sides - adaptive learning rate and faster convergence with momentum.

Adam (Deep Learning Optimization Algorithm)

The momentum method, an accelerated gradient method widely used in deep learning optimization, was originally proposed by Boris T. Polyak in $$1964$$.

Origin of the Momentum Method

When implementing the momentum method in optimization, the velocity vector $$\mathbf{v}_t$$ accumulates past gradients to update the parameters. At the beginning of the optimization process, specifically at time $$t=0$$, this velocity vector is conveniently initialized to zero, denoted as $$\mathbf{v}_0 = 0$$.

Velocity Initialization in Momentum Method

To mathematically analyze the convergence of the momentum method on a scalar quadratic function $$f(x) = \frac{\lambda}{2} x^2$$, the update equations for both the position $$x$$ and velocity $$v$$ can be formulated as a coupled system: $$\begin{bmatrix} v_{t+1} \ x_{t+1} \end{bmatrix} = \begin{bmatrix} \beta & \lambda \ -\eta \beta & (1 - \eta \lambda) \end{bmatrix} \begin{bmatrix} v_{t} \ x_{t} \end{bmatrix}$$. The convergence behavior is entirely governed by the eigenvalues of this $$2 	imes 2$$ transition matrix. Mathematical analysis of this matrix shows that the velocity converges when the hyperparameters satisfy $$0 < \eta \lambda < 2 + 2 \beta$$. This feasible range is substantially larger than the $$0 < \eta \lambda < 2$$ constraint required for standard gradient descent, mathematically confirming that large momentum coefficients ($$\beta$$) safely permit much larger learning rates without divergence.

Learn Before

Related