Concept

Computational Cost of AdaGrad

Although the AdaGrad algorithm requires maintaining an auxiliary state variable st\mathbf{s}_t to allow for an individual learning rate per coordinate, this additional operation does not significantly increase its computational cost relative to standard stochastic gradient descent (SGD). The storage and element-wise arithmetic required to update st\mathbf{s}_t are relatively inexpensive, simply because the primary computational expense in optimizing deep learning models remains the forward pass to evaluate the objective function l(yt,f(xt,w))l(y_t, f(\mathbf{x}_t, \mathbf{w})) and the backward pass to compute its derivative.

0

1

Updated 2026-05-15

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L