1Cademy - AdaGrad Optimizer From-Scratch Implementation

Learn Before

AdaGrad Update Rule

Code

AdaGrad Optimizer From-Scratch Implementation

A from-scratch implementation of the AdaGrad optimization algorithm requires maintaining an auxiliary state variable for each parameter tensor, initialized to zeros with the identical shape. During an update step, the element-wise square of the gradient is accumulated into this state. The parameter is then updated using an individually scaled learning rate: it is decremented by the product of the initial learning rate and the gradient, divided by the square root of the accumulated state plus a small constant (e.g., 1e-6) for numerical stability. Finally, the parameter gradients are zeroed out. Because AdaGrad continuously scales down the effective learning rate through this state variable, training typically requires a larger initial learning rate compared to standard mini-batch stochastic gradient descent.

In PyTorch, this can be implemented as follows:

def init_adagrad_states(feature_dim):
    s_w = torch.zeros((feature_dim, 1))
    s_b = torch.zeros(1)
    return (s_w, s_b)

def adagrad(params, states, hyperparams):
    eps = 1e-6
    for p, s in zip(params, states):
        with torch.no_grad():
            s[:] += torch.square(p.grad)
            p[:] -= hyperparams['lr'] * p.grad / torch.sqrt(s + eps)
        p.grad.data.zero_()

Updated 2026-05-15

Contributors are:

Who are from:

References

Dive into Deep Learning

Learn Before

Related