Learn Before
Code

Adam Optimizer From-Scratch Implementation

A from-scratch implementation of the Adam optimizer requires initializing the state variables and performing the parameter updates iteratively over time steps tt. For state initialization, two auxiliary variables are created for each parameter tensor: one to track the momentum (v\mathbf{v}) and one to track the second moment (s\mathbf{s}), both initialized to zeros. During each optimization step, the algorithm first updates these state variables using exponential weighted moving averages of the gradients and their element-wise squares. It then applies bias correction to both state variables by dividing them by 1β1t1 - \beta_1^t and 1β2t1 - \beta_2^t, respectively. Finally, the model parameters are updated by subtracting the learning rate scaled by the bias-corrected momentum, divided by the square root of the bias-corrected second moment plus a small constant (ϵ\epsilon) for numerical stability. In PyTorch, this is implemented as follows:

def init_adam_states(feature_dim): v_w, v_b = torch.zeros((feature_dim, 1)), torch.zeros(1) s_w, s_b = torch.zeros((feature_dim, 1)), torch.zeros(1) return ((v_w, s_w), (v_b, s_b)) def adam(params, states, hyperparams): beta1, beta2, eps = 0.9, 0.999, 1e-6 for p, (v, s) in zip(params, states): with torch.no_grad(): v[:] = beta1 * v + (1 - beta1) * p.grad s[:] = beta2 * s + (1 - beta2) * torch.square(p.grad) v_bias_corr = v / (1 - beta1 ** hyperparams['t']) s_bias_corr = s / (1 - beta2 ** hyperparams['t']) p[:] -= hyperparams['lr'] * v_bias_corr / (torch.sqrt(s_bias_corr) + eps) p.grad.data.zero_() hyperparams['t'] += 1

0

1

Updated 2026-05-16

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L