A from-scratch implementation of the Adam optimizer requires initializing the state variables and performing the parameter updates iteratively over time steps $$t$$. For state initialization, two auxiliary variables are created for each parameter tensor: one to track the momentum ($$\mathbf{v}$$) and one to track the second moment ($$\mathbf{s}$$), both initialized to zeros. During each optimization step, the algorithm first updates these state variables using exponential weighted moving averages of the gradients and their element-wise squares. It then applies bias correction to both state variables by dividing them by $$1 - \beta_1^t$$ and $$1 - \beta_2^t$$, respectively. Finally, the model parameters are updated by subtracting the learning rate scaled by the bias-corrected momentum, divided by the square root of the bias-corrected second moment plus a small constant ($$\epsilon$$) for numerical stability. In PyTorch, this is implemented as follows:

```python
def init_adam_states(feature_dim):
    v_w, v_b = torch.zeros((feature_dim, 1)), torch.zeros(1)
    s_w, s_b = torch.zeros((feature_dim, 1)), torch.zeros(1)
    return ((v_w, s_w), (v_b, s_b))

def adam(params, states, hyperparams):
    beta1, beta2, eps = 0.9, 0.999, 1e-6
    for p, (v, s) in zip(params, states):
        with torch.no_grad():
            v[:] = beta1 * v + (1 - beta1) * p.grad
            s[:] = beta2 * s + (1 - beta2) * torch.square(p.grad)
            v_bias_corr = v / (1 - beta1 ** hyperparams['t'])
            s_bias_corr = s / (1 - beta2 ** hyperparams['t'])
            p[:] -= hyperparams['lr'] * v_bias_corr / (torch.sqrt(s_bias_corr) + eps)
        p.grad.data.zero_()
    hyperparams['t'] += 1
```

Adam Optimizer From-Scratch Implementation

After computing the bias-corrected state variables, the Adam optimization algorithm calculates its final parameter updates. First, it rescales the gradient to obtain $$\mathbf{g}_t' = \frac{\eta \hat{\mathbf{v}}_t}{\sqrt{\hat{\mathbf{s}}_t} + \epsilon}$$. While similar to RMSProp, this rescaling uses the debiased momentum $$\hat{\mathbf{v}}_t$$ rather than the raw gradient, and the $$\epsilon$$ parameter (typically $$10^{-6}$$ for numerical stability) is added outside the square root. Finally, the model parameters are updated using the explicit learning rate $$\eta$$, which controls the step length, via the simple rule $$\mathbf{x}_t \leftarrow \mathbf{x}_{t-1} - \mathbf{g}_t'$$.

University of California, Berkeley

University of Michigan - Ann Arbor

Claude

Adam stands for adaptive moment estimation.
It combines gradient descent with momentum, and RMSProp. It brings the benefits from both sides - adaptive learning rate and faster convergence with momentum.

Adam (Deep Learning Optimization Algorithm)

In the Adam optimizer, the state variables for momentum ($$\mathbf{v}_t$$) and the second moment ($$\mathbf{s}_t$$) are typically initialized to zero ($$\mathbf{v}_0 = \mathbf{s}_0 = 0$$). This initialization introduces a significant bias towards smaller values during the initial training steps. To correct this bias, Adam re-normalizes the terms using the sum of the weights $$\sum_{i=0}^{t-1} \beta^i = \frac{1 - \beta^t}{1 - \beta}$$. The resulting debiased, or normalized, state variables are computed as $$\hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_1^t}$$ and $$\hat{\mathbf{s}}_t = \frac{\mathbf{s}_t}{1 - \beta_2^t}$$.

Adam Bias Correction

Keskar, N. S., & Socher, R. (2017). Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628.

Improving Generalization Performance by Switching from Adam to SGD

Dive into Deep Learning

```python
import numpy as np

class Adam():
  
    self.beta1 = 0.9
    self.beta2 = 0.99

    # Here is the M amd V matrices for the parameters and biases
    self.m_weights = np.zeros((input_dims, nodes))
    self.v_weights = np.zeros((input_dims, nodes))
    self.m_biases = np.zeros(nodes)
    self.v_biases = np.zeros(nodes)

    # need to track thecurrent time stamp
    self.curr_iter = 0

    def get_steps(self, grad_weights, grad_biases, learning_rate):

        eps = 1e-8

        # Just follows the formula and returns the update that have to be 
        # subtracted from the parametes and biases
      
        self.m_weights = self.beta1 * self.m_weights + (1 - self.beta1) * grad_weights
        self.v_weights = self.beta2 * self.v_weights + (1 - self.beta2) * np.power(grad_weights, 2)
        self.m_biases = self.beta1 * self.m_biases + (1 - self.beta1) * grad_biases
        self.v_biases = self.beta2 * self.v_biases + (1 - self.beta2) * np.power(grad_biases, 2)

        self.curr_iter += 1
        self.m_weights = self.m_weights / (1 - np.power(self.beta1, self.curr_iter))
        self.v_weights = self.v_weights / (1 - np.power(self.beta2, self.curr_iter))
        self.m_biases = self.m_biases / (1 - np.power(self.beta1, self.curr_iter))
        self.v_biases = self.v_biases / (1 - np.power(self.beta2, self.curr_iter))

        weights_step = np.multiply((learning_rate / (np.sqrt(self.v_weights) + eps)), self.m_weights)
        biases_step = np.multiply((learning_rate / (np.sqrt(self.v_biases) + eps)), self.m_biases)

        return weights_step, biases_step
```

Adam (Deep Learning Optimization Algorithm) Python Implementation

- Adam is fast, but tends to overfit
- SGD is slow but gives great results
- RMSProp sometimes works best
- SWA can easily improve quality
- AdaTune magically improves the learning rate

Adam vs. SGD vs. RMSProp vs. SWA vs. AdaTune

A key component of the Adam optimization algorithm is its use of exponential weighted moving averages, or leaky averaging, to estimate both the momentum and the second moment of the gradient. At each time step $$t$$, it maintains two state variables: $$\mathbf{v}_t \leftarrow \beta_1 \mathbf{v}_{t-1} + (1 - \beta_1) \mathbf{g}_t$$ and $$\mathbf{s}_t \leftarrow \beta_2 \mathbf{s}_{t-1} + (1 - \beta_2) \mathbf{g}_t^2$$. The terms $$\beta_1$$ and $$\beta_2$$ are nonnegative weighting parameters. Common default choices are $$\beta_1 = 0.9$$ and $$\beta_2 = 0.999$$, which ensures that the variance estimate $$\mathbf{s}_t$$ adapts much more slowly than the momentum term $$\mathbf{v}_t$$.

Learn Before

Related

Learn After