When applying gradient descent to minimize a scalar quadratic function $$f(x) = \frac{\lambda}{2} x^2$$, the step-by-step update rule simplifies to $$x_{t+1} = x_t - \eta \lambda x_t = (1 - \eta \lambda) x_t$$, where $$\eta$$ is the learning rate and $$\lambda$$ represents the curvature. After $$t$$ iterations, the position is explicitly given by $$x_t = (1 - \eta \lambda)^t x_0$$. This demonstrates that the optimization converges exponentially toward the minimum at $$x=0$$ provided that the condition $$|1 - \eta \lambda| < 1$$ is met. This inequality shows that the convergence rate improves as $$\eta$$ increases until $$\eta \lambda = 1$$, but if the learning rate is too large such that $$\eta \lambda > 2$$, the sequence diverges entirely.

Gradient Descent Convergence on a Scalar Quadratic

The convergence dynamics of gradient descent on a scalar quadratic function $$f(x) = \frac{\lambda}{2} x^2$$ can be visualized programmatically by plotting the decay term $$(1 - \eta \lambda)^t$$ over successive time steps $$t$$. For instance, fixing the learning rate at $$\eta = 0.1$$ and iterating over a set of curvature values (such as $$\lambda \in \{0.1, 1, 10, 19\}$$) produces distinct trajectory curves. These plots practically reveal that values satisfying $$0 < \eta \lambda \le 1$$ yield smooth exponential decay, values where $$1 < \eta \lambda < 2$$ exhibit oscillatory but converging behavior, and values approaching the theoretical limit of $$\eta \lambda = 2$$ show dangerously slow convergence before divergence occurs.

Scalar Quadratic Gradient Descent Convergence Visualization

To illustrate one-dimensional gradient descent concretely, consider minimizing the objective function $$f(x) = x^2$$, whose derivative is $$f'(x) = 2x$$. Although the minimum at $$x = 0$$ is known analytically, applying gradient descent with an initial value of $$x = 10$$ and a learning rate of $$\eta = 0.2$$ demonstrates how the iterative update $$x \leftarrow x - \eta \cdot 2x$$ drives $$x$$ toward the optimum. After $$10$$ iterations, $$x$$ reaches approximately $$0.0605$$, confirming that the algorithm steadily reduces the function value and converges close to the true minimum.

Claude

One-dimensional gradient descent provides a clear illustration of why moving in the negative gradient direction reduces the objective function. For a continuously differentiable function $$f: \mathbb{R} ightarrow \mathbb{R}$$, the first-order Taylor expansion gives $$f(x + \epsilon) = f(x) + \epsilon f'(x) + \mathcal{O}(\epsilon^2)$$. Setting the step as $$\epsilon = -\eta f'(x)$$, where $$\eta > 0$$ is a fixed learning rate, yields $$f(x - \eta f'(x)) = f(x) - \eta f'^2(x) + \mathcal{O}(\eta^2 f'^2(x))$$. When the derivative $$f'(x) 
eq 0$$, the term $$\eta f'^2(x) > 0$$ guarantees a decrease in $$f$$, provided $$\eta$$ is small enough for the higher-order terms to be negligible. This leads to the update rule $$x \leftarrow x - \eta f'(x)$$, which is applied iteratively from an initial value until a stopping condition is met, such as when the gradient magnitude $$|f'(x)|$$ becomes sufficiently small or a maximum number of iterations is reached.

One-Dimensional Gradient Descent

Dive into Deep Learning

One-Dimensional Gradient Descent on a Quadratic

The learning rate $$\eta$$ is a positive scalar chosen by the algorithm designer that controls the size of each parameter update step in gradient descent. It directly scales the gradient to determine how far the parameters move in the negative gradient direction at each iteration. Setting $$\eta$$ appropriately is critical: a value that is too small results in very slow updates, requiring many more iterations to approach the optimum, while a value that is too large can cause the update step $$\eta f'(x)$$ to become so large that the first-order Taylor approximation breaks down, potentially causing the iterates to overshoot the minimum and diverge rather than converge.

Learn Before

Related

Learn After