Example

RMSProp Optimization Trajectory in 2D

To visualize RMSProp's convergence behavior, the algorithm is applied to the two-dimensional quadratic function f(x)=0.1x12+2x22f(\mathbf{x}) = 0.1 x_1^2 + 2 x_2^2 with a learning rate of η=0.4\eta = 0.4 and decay parameter γ=0.9\gamma = 0.9. The coordinate-wise implementation computes gradients g1=0.2x1g_1 = 0.2 x_1 and g2=4x2g_2 = 4 x_2, updates the leaky averages of squared gradients as si=γsi+(1γ)gi2s_i = \gamma s_i + (1 - \gamma) g_i^2, and adjusts each coordinate by xixiηsi+ϵgix_i \leftarrow x_i - \frac{\eta}{\sqrt{s_i + \epsilon}} g_i. After 2020 epochs, the variables converge near the origin (x10.0106x_1 \approx -0.0106, x20x_2 \approx 0). Unlike AdaGrad, which stalls in later iterations because the learning rate decreases too quickly, RMSProp maintains effective progress throughout training because η\eta is controlled independently from the state variable rescaling.

python def rmsprop_2d(x1, x2, s1, s2): g1, g2, eps = 0.2 * x1, 4 * x2, 1e-6 s1 = gamma * s1 + (1 - gamma) * g1 ** 2 s2 = gamma * s2 + (1 - gamma) * g2 ** 2 x1 -= eta / math.sqrt(s1 + eps) * g1 x2 -= eta / math.sqrt(s2 + eps) * g2 return x1, x2, s1, s2

eta, gamma = 0.4, 0.9

Image 0

0

1

Updated 2026-05-15

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L