Learn Before
Adam Optimizer From-Scratch Implementation
A from-scratch implementation of the Adam optimizer requires initializing the state variables and performing the parameter updates iteratively over time steps . For state initialization, two auxiliary variables are created for each parameter tensor: one to track the momentum () and one to track the second moment (), both initialized to zeros. During each optimization step, the algorithm first updates these state variables using exponential weighted moving averages of the gradients and their element-wise squares. It then applies bias correction to both state variables by dividing them by and , respectively. Finally, the model parameters are updated by subtracting the learning rate scaled by the bias-corrected momentum, divided by the square root of the bias-corrected second moment plus a small constant () for numerical stability. In PyTorch, this is implemented as follows:
def init_adam_states(feature_dim): v_w, v_b = torch.zeros((feature_dim, 1)), torch.zeros(1) s_w, s_b = torch.zeros((feature_dim, 1)), torch.zeros(1) return ((v_w, s_w), (v_b, s_b)) def adam(params, states, hyperparams): beta1, beta2, eps = 0.9, 0.999, 1e-6 for p, (v, s) in zip(params, states): with torch.no_grad(): v[:] = beta1 * v + (1 - beta1) * p.grad s[:] = beta2 * s + (1 - beta2) * torch.square(p.grad) v_bias_corr = v / (1 - beta1 ** hyperparams['t']) s_bias_corr = s / (1 - beta2 ** hyperparams['t']) p[:] -= hyperparams['lr'] * v_bias_corr / (torch.sqrt(s_bias_corr) + eps) p.grad.data.zero_() hyperparams['t'] += 1
0
1
Tags
D2L
Dive into Deep Learning @ D2L