1Cademy - Adam Optimizer From-Scratch Implementation

Learn Before

Adam Optimizer Update Rule

Code

Adam Optimizer From-Scratch Implementation

A from-scratch implementation of the Adam optimizer requires initializing the state variables and performing the parameter updates iteratively over time steps $t$ . For state initialization, two auxiliary variables are created for each parameter tensor: one to track the momentum ( $\mathbf{v}$ ) and one to track the second moment ( $\mathbf{s}$ ), both initialized to zeros. During each optimization step, the algorithm first updates these state variables using exponential weighted moving averages of the gradients and their element-wise squares. It then applies bias correction to both state variables by dividing them by $1 - \beta_1^t$ and $1 - \beta_2^t$ , respectively. Finally, the model parameters are updated by subtracting the learning rate scaled by the bias-corrected momentum, divided by the square root of the bias-corrected second moment plus a small constant ( $\epsilon$ ) for numerical stability. In PyTorch, this is implemented as follows:

def init_adam_states(feature_dim):
    v_w, v_b = torch.zeros((feature_dim, 1)), torch.zeros(1)
    s_w, s_b = torch.zeros((feature_dim, 1)), torch.zeros(1)
    return ((v_w, s_w), (v_b, s_b))

def adam(params, states, hyperparams):
    beta1, beta2, eps = 0.9, 0.999, 1e-6
    for p, (v, s) in zip(params, states):
        with torch.no_grad():
            v[:] = beta1 * v + (1 - beta1) * p.grad
            s[:] = beta2 * s + (1 - beta2) * torch.square(p.grad)
            v_bias_corr = v / (1 - beta1 ** hyperparams['t'])
            s_bias_corr = s / (1 - beta2 ** hyperparams['t'])
            p[:] -= hyperparams['lr'] * v_bias_corr / (torch.sqrt(s_bias_corr) + eps)
        p.grad.data.zero_()
    hyperparams['t'] += 1

Updated 2026-05-16

Contributors are:

Who are from:

References

Dive into Deep Learning

Learn After

Concise Adam Implementation

Learn Before

Related

Learn After