1Cademy - Effect of an Excessive Learning Rate on Gradient Descent

Learn Before

Learning Rate

Concept

Effect of an Excessive Learning Rate on Gradient Descent

When the learning rate $\eta$ is set too high, the magnitude of the gradient step $|\eta f'(x)|$ can become large enough to invalidate the first-order Taylor expansion used to justify gradient descent. Specifically, the higher-order remainder term $\mathcal{O}(\eta^2 (f'(x))^2)$ is no longer negligible, so there is no guarantee that the function value will decrease after each update. In practice, the iterates overshoot the optimal solution and can diverge. For example, applying gradient descent to $f(x) = x^2$ with $\eta = 1.1$ starting from $x = 5$ causes $x$ to overshoot the minimum at $x = 0$ and gradually diverge, reaching approximately 61.92 after $10$ iterations instead of converging.

Updated 2026-05-15

Contributors are:

Who are from:

References

Dive into Deep Learning

Learn After

Adaptive Optimization Methods

Learn Before

Related

Learn After