Learn Before
Concept
Effect of an Excessive Learning Rate on Gradient Descent
When the learning rate is set too high, the magnitude of the gradient step can become large enough to invalidate the first-order Taylor expansion used to justify gradient descent. Specifically, the higher-order remainder term is no longer negligible, so there is no guarantee that the function value will decrease after each update. In practice, the iterates overshoot the optimal solution and can diverge. For example, applying gradient descent to with starting from causes to overshoot the minimum at and gradually diverge, reaching approximately after iterations instead of converging.
0
1
Updated 2026-05-15
Tags
D2L
Dive into Deep Learning @ D2L