In deep learning, directly applying diagonal preconditioning is typically impossible because computing the true second derivative (the Hessian matrix) for a parameter vector $$\mathbf{x} \in \mathbb{R}^d$$ requires an impractical $$\mathcal{O}(d^2)$$ space and computation. To bypass this severe bottleneck, the AdaGrad optimization algorithm ingeniously utilizes the variance, or magnitude, of the stochastic gradients themselves. Because stochastic gradient descent ensures nonzero gradient variance even at optimality, this variance serves as a highly effective and computationally inexpensive proxy to estimate the scale of the elusive Hessian diagonal.

Claude

Because exact preconditioning via full eigendecomposition is computationally prohibitive, a significantly cheaper alternative is to approximate the distortion by rescaling the problem using only the diagonal entries of the matrix $$\mathbf{Q}$$. This diagonal preconditioning calculates a new matrix $$\tilde{\mathbf{Q}} = \textrm{diag}^{-\frac{1}{2}}(\mathbf{Q}) \mathbf{Q} \textrm{diag}^{-\frac{1}{2}}(\mathbf{Q})$$. In this rescaled representation, the entries become $$\tilde{\mathbf{Q}}_{ij} = \mathbf{Q}_{ij} / \sqrt{\mathbf{Q}_{ii} \mathbf{Q}_{jj}}$$, ensuring that every diagonal element $$\tilde{\mathbf{Q}}_{ii} = 1$$. In many scenarios, particularly when the problem is roughly axis-aligned, this straightforward rescaling considerably reduces the condition number without the massive cost of computing true eigenvalues.

Learn Before

Related