As a widely adopted optimization algorithm, RMSProp is available as a built-in optimizer in all major deep learning frameworks, allowing practitioners to use it without manually coding the state variable updates. In PyTorch, the optimizer is instantiated via `torch.optim.RMSprop`, with the decay parameter $$\gamma$$ passed as `alpha` and the learning rate as `lr`. In MXNet's Gluon API, the algorithm is specified by the string `'rmsprop'`, with the decay parameter assigned to `gamma1` and the learning rate to `learning_rate`. In TensorFlow, the optimizer is created using `tf.keras.optimizers.RMSprop`, where the decay parameter is named `rho` and the learning rate is `learning_rate`. Despite the differing parameter names across frameworks, the underlying algorithm is identical: each maintains the exponentially weighted average of squared gradients internally and performs the adaptive learning rate scaling automatically. When trained on the Airfoil Self-Noise dataset with a learning rate of 0.01 and $$\gamma = 0.9$$, all three implementations converge to a training loss of approximately 0.245, matching the from-scratch implementation's performance.

```python
# PyTorch
trainer = torch.optim.RMSprop
d2l.train_concise_ch11(trainer, {'lr': 0.01, 'alpha': 0.9}, data_iter)
```

Concise RMSProp Implementation

When training a linear regression model from scratch on the Airfoil Self-Noise dataset using the RMSProp optimizer with an initial learning rate of 0.01 and a decay parameter of $$\gamma = 0.9$$, the training loss converges to approximately 0.245. Setting $$\gamma = 0.9$$ means the algorithm aggregates, on average, over the past 10 observations ($$\frac{1}{1-\gamma}$$) of the squared gradient. This typical configuration uses a modest learning rate paired with a high decay factor, contrasting with AdaGrad which often demands a larger initial learning rate to counteract its aggressive learning rate decay.

Claude

Google

A from-scratch implementation of the RMSProp optimizer for deep networks requires maintaining an auxiliary state variable for each parameter tensor, initialized to zeros with the same shape. During each update step, the state is updated as a leaky average of squared gradients: $$\mathbf{s} \leftarrow \gamma \mathbf{s} + (1 - \gamma) \mathbf{g}^2$$, where $$\gamma$$ is the decay factor and $$\mathbf{g}$$ is the current gradient. The parameter is then decremented by the learning rate times the gradient divided by the square root of the state plus a numerical stability constant ($$\epsilon = 10^{-6}$$). Finally, the parameter gradients are zeroed out.

In PyTorch, this can be implemented as follows:

python
def init_rmsprop_states(feature_dim):
    s_w = torch.zeros((feature_dim, 1))
    s_b = torch.zeros(1)
    return (s_w, s_b)

def rmsprop(params, states, hyperparams):
    gamma, eps = hyperparams['gamma'], 1e-6
    for p, s in zip(params, states):
        with torch.no_grad():
            s[:] = gamma * s + (1 - gamma) * torch.square(p.grad)
            p[:] -= hyperparams['lr'] * p.grad / torch.sqrt(s + eps)
        p.grad.data.zero_()


Learn Before

Related

Learn After