Learn Before
Example

Effect of a Small Learning Rate on Gradient Descent

When the learning rate η\eta is chosen to be too small, each gradient descent update moves the parameter xx only a tiny distance toward the optimum. This results in extremely slow progress, with the algorithm requiring a large number of iterations to reach a satisfactory solution. For instance, applying gradient descent to the quadratic f(x)=x2f(x) = x^2 with η=0.05\eta = 0.05 and starting from x=10x = 10, the parameter value is still approximately 3.493.49 after 1010 iterations—far from the optimal solution at x=0x = 0. While a small learning rate ensures that the first-order Taylor approximation remains valid and the function value decreases at every step, the practical cost is an unacceptably slow convergence rate.

Image 0

0

1

Updated 2026-05-15

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L

Learn After