Code

AdaGrad Optimization Trajectory in 2D

To observe the behavior of AdaGrad in a quadratic convex problem, we can apply it to the two-dimensional function f(x)=0.1x12+2x22f(\mathbf{x}) = 0.1 x_1^2 + 2 x_2^2. In Python, the coordinate-wise update can be expressed as:

import math def adagrad_2d(x1, x2, s1, s2, eta): eps = 1e-6 g1, g2 = 0.2 * x1, 4 * x2 s1 += g1 ** 2 s2 += g2 ** 2 x1 -= eta / math.sqrt(s1 + eps) * g1 x2 -= eta / math.sqrt(s2 + eps) * g2 return x1, x2, s1, s2

When optimized with a standard learning rate (e.g., η=0.4\eta = 0.4), the trajectory is initially smooth, but the independent variables stop moving early due to the cumulative effect of the state variable st\mathbf{s}_t continuously decaying the learning rate. Increasing the initial learning rate to a much larger value (e.g., η=2\eta = 2) yields better convergence behavior, demonstrating that AdaGrad's learning rate decrease can be quite aggressive and may require careful hyperparameter selection.

0

1

Updated 2026-05-15

Tags

Deep Learning (in Machine learning)

Data Science

D2L

Dive into Deep Learning @ D2L

Computing Sciences