When the learning rate $$\eta$$ is chosen to be too small, each gradient descent update moves the parameter $$x$$ only a tiny distance toward the optimum. This results in extremely slow progress, with the algorithm requiring a large number of iterations to reach a satisfactory solution. For instance, applying gradient descent to the quadratic $$f(x) = x^2$$ with $$\eta = 0.05$$ and starting from $$x = 10$$, the parameter value is still approximately $$3.49$$ after $$10$$ iterations—far from the optimal solution at $$x = 0$$. While a small learning rate ensures that the first-order Taylor approximation remains valid and the function value decreases at every step, the practical cost is an unacceptably slow convergence rate.

Effect of a Small Learning Rate on Gradient Descent

When the learning rate $$\eta$$ is set too high, the magnitude of the gradient step $$|\eta f'(x)|$$ can become large enough to invalidate the first-order Taylor expansion used to justify gradient descent. Specifically, the higher-order remainder term $$\mathcal{O}(\eta^2 (f'(x))^2)$$ is no longer negligible, so there is no guarantee that the function value will decrease after each update. In practice, the iterates overshoot the optimal solution and can diverge. For example, applying gradient descent to $$f(x) = x^2$$ with $$\eta = 1.1$$ starting from $$x = 5$$ causes $$x$$ to overshoot the minimum at $$x = 0$$ and gradually diverge, reaching approximately $$61.92$$ after $$10$$ iterations instead of converging.

Effect of an Excessive Learning Rate on Gradient Descent

A learning rate scheduler is a mechanism used to dynamically adjust the learning rate during the training of a model. Rather than keeping the learning rate constant, the scheduler acts as a function that takes the number of optimization updates (such as iterations or epochs) and outputs the appropriate learning rate value for the next step.

Learning Rate Scheduler

When training advanced neural network designs, initializing the parameters is sometimes insufficient to guarantee stable optimization. This creates an optimization dilemma: choosing a sufficiently small initial learning rate prevents early divergence but results in extremely slow progress, whereas choosing a large initial learning rate leads to immediate divergence.

Dilemma of Initial Learning Rate

The learning rate $$\eta$$ is a positive scalar chosen by the algorithm designer that controls the size of each parameter update step in gradient descent. It directly scales the gradient to determine how far the parameters move in the negative gradient direction at each iteration. Setting $$\eta$$ appropriately is critical: a value that is too small results in very slow updates, requiring many more iterations to approach the optimum, while a value that is too large can cause the update step $$\eta f'(x)$$ to become so large that the first-order Taylor approximation breaks down, potentially causing the iterates to overshoot the minimum and diverge rather than converge.

Claude

University of Michigan - Ann Arbor

One-dimensional gradient descent provides a clear illustration of why moving in the negative gradient direction reduces the objective function. For a continuously differentiable function $$f: \mathbb{R} ightarrow \mathbb{R}$$, the first-order Taylor expansion gives $$f(x + \epsilon) = f(x) + \epsilon f'(x) + \mathcal{O}(\epsilon^2)$$. Setting the step as $$\epsilon = -\eta f'(x)$$, where $$\eta > 0$$ is a fixed learning rate, yields $$f(x - \eta f'(x)) = f(x) - \eta f'^2(x) + \mathcal{O}(\eta^2 f'^2(x))$$. When the derivative $$f'(x) 
eq 0$$, the term $$\eta f'^2(x) > 0$$ guarantees a decrease in $$f$$, provided $$\eta$$ is small enough for the higher-order terms to be negligible. This leads to the update rule $$x \leftarrow x - \eta f'(x)$$, which is applied iteratively from an initial value until a stopping condition is met, such as when the gradient magnitude $$|f'(x)|$$ becomes sufficiently small or a maximum number of iterations is reached.

One-Dimensional Gradient Descent

Goodfellow, I., Bengio, Y., & Courville, A. (2016). $\mathit{Deep \ Learning.}$ MIT Press. Retrieved from [www.deeplearningbook.org](https://www.deeplearningbook.org) 

Deep Learning

Dive into Deep Learning

To illustrate one-dimensional gradient descent concretely, consider minimizing the objective function $$f(x) = x^2$$, whose derivative is $$f'(x) = 2x$$. Although the minimum at $$x = 0$$ is known analytically, applying gradient descent with an initial value of $$x = 10$$ and a learning rate of $$\eta = 0.2$$ demonstrates how the iterative update $$x \leftarrow x - \eta \cdot 2x$$ drives $$x$$ toward the optimum. After $$10$$ iterations, $$x$$ reaches approximately $$0.0605$$, confirming that the algorithm steadily reduces the function value and converges close to the true minimum.

Learn Before

Related

Learn After