Concept

AdaGrad's Proxy for the Hessian Diagonal

In deep learning, directly applying diagonal preconditioning is typically impossible because computing the true second derivative (the Hessian matrix) for a parameter vector xRd\mathbf{x} \in \mathbb{R}^d requires an impractical O(d2)\mathcal{O}(d^2) space and computation. To bypass this severe bottleneck, the AdaGrad optimization algorithm ingeniously utilizes the variance, or magnitude, of the stochastic gradients themselves. Because stochastic gradient descent ensures nonzero gradient variance even at optimality, this variance serves as a highly effective and computationally inexpensive proxy to estimate the scale of the elusive Hessian diagonal.

0

1

Updated 2026-05-15

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L