AdaGrad Optimization Trajectory in 2D
To observe the behavior of AdaGrad in a quadratic convex problem, we can apply it to the two-dimensional function . In Python, the coordinate-wise update can be expressed as:
import math def adagrad_2d(x1, x2, s1, s2, eta): eps = 1e-6 g1, g2 = 0.2 * x1, 4 * x2 s1 += g1 ** 2 s2 += g2 ** 2 x1 -= eta / math.sqrt(s1 + eps) * g1 x2 -= eta / math.sqrt(s2 + eps) * g2 return x1, x2, s1, s2
When optimized with a standard learning rate (e.g., ), the trajectory is initially smooth, but the independent variables stop moving early due to the cumulative effect of the state variable continuously decaying the learning rate. Increasing the initial learning rate to a much larger value (e.g., ) yields better convergence behavior, demonstrating that AdaGrad's learning rate decrease can be quite aggressive and may require careful hyperparameter selection.
0
1
Contributors are:
Who are from:
Tags
Deep Learning (in Machine learning)
Data Science
D2L
Dive into Deep Learning @ D2L
Computing Sciences