Learn Before
Formula

Adadelta Update Rule

The Adadelta algorithm updates parameters using a sequence of operations based on leaky averages. Given a decay parameter ρ\rho, the state variable for the gradient's second moment is updated as st=ρst1+(1ρ)gt2\mathbf{s}_t = \rho \mathbf{s}_{t-1} + (1 - \rho) \mathbf{g}_t^2. A rescaled gradient gt\mathbf{g}_t' is then computed using the ratio of the root mean square of previous parameter changes to the root mean square of the gradients: gt=Δxt1+ϵst+ϵgt\mathbf{g}_t' = \frac{\sqrt{\Delta\mathbf{x}_{t-1} + \epsilon}}{\sqrt{{\mathbf{s}_t + \epsilon}}} \odot \mathbf{g}_t. The model parameters are updated by subtracting this rescaled gradient: xt=xt1gt\mathbf{x}_t = \mathbf{x}_{t-1} - \mathbf{g}_t'. Finally, the state variable tracking the parameter changes, initialized at Δx0=0\Delta \mathbf{x}_0 = 0, is updated as Δxt=ρΔxt1+(1ρ)gt2\Delta \mathbf{x}_t = \rho \Delta\mathbf{x}_{t-1} + (1 - \rho) {\mathbf{g}_t'}^2, where ϵ\epsilon is a small constant (e.g., 10510^{-5}) added to maintain numerical stability.

0

1

Updated 2026-05-16

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L