Code

AdaGrad Optimizer From-Scratch Implementation

A from-scratch implementation of the AdaGrad optimization algorithm requires maintaining an auxiliary state variable for each parameter tensor, initialized to zeros with the identical shape. During an update step, the element-wise square of the gradient is accumulated into this state. The parameter is then updated using an individually scaled learning rate: it is decremented by the product of the initial learning rate and the gradient, divided by the square root of the accumulated state plus a small constant (e.g., 1e-6) for numerical stability. Finally, the parameter gradients are zeroed out. Because AdaGrad continuously scales down the effective learning rate through this state variable, training typically requires a larger initial learning rate compared to standard mini-batch stochastic gradient descent.

In PyTorch, this can be implemented as follows:

def init_adagrad_states(feature_dim): s_w = torch.zeros((feature_dim, 1)) s_b = torch.zeros(1) return (s_w, s_b) def adagrad(params, states, hyperparams): eps = 1e-6 for p, s in zip(params, states): with torch.no_grad(): s[:] += torch.square(p.grad) p[:] -= hyperparams['lr'] * p.grad / torch.sqrt(s + eps) p.grad.data.zero_()

0

1

Updated 2026-05-15

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L