Learn Before
Concept
AdaGrad's Proxy for the Hessian Diagonal
In deep learning, directly applying diagonal preconditioning is typically impossible because computing the true second derivative (the Hessian matrix) for a parameter vector requires an impractical space and computation. To bypass this severe bottleneck, the AdaGrad optimization algorithm ingeniously utilizes the variance, or magnitude, of the stochastic gradients themselves. Because stochastic gradient descent ensures nonzero gradient variance even at optimality, this variance serves as a highly effective and computationally inexpensive proxy to estimate the scale of the elusive Hessian diagonal.
0
1
Updated 2026-05-15
Tags
D2L
Dive into Deep Learning @ D2L