Learn Before
Concept

Effect of an Excessive Learning Rate on Gradient Descent

When the learning rate η\eta is set too high, the magnitude of the gradient step ηf(x)|\eta f'(x)| can become large enough to invalidate the first-order Taylor expansion used to justify gradient descent. Specifically, the higher-order remainder term O(η2(f(x))2)\mathcal{O}(\eta^2 (f'(x))^2) is no longer negligible, so there is no guarantee that the function value will decrease after each update. In practice, the iterates overshoot the optimal solution and can diverge. For example, applying gradient descent to f(x)=x2f(x) = x^2 with η=1.1\eta = 1.1 starting from x=5x = 5 causes xx to overshoot the minimum at x=0x = 0 and gradually diverge, reaching approximately 61.9261.92 after 1010 iterations instead of converging.

Image 0

0

1

Updated 2026-05-15

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L

Learn After