1Cademy - Aggressive Learning Rate Decay in AdaGrad

Learn Before

Concept

Aggressive Learning Rate Decay in AdaGrad

A key characteristic of AdaGrad is that it accumulates squared gradients in a state variable $\mathbf{s}_t$ . Because this sum continuously grows at an approximately linear rate, the effective per-coordinate learning rate decays at a rate of $\mathcal{O}(t^{-\frac{1}{2}})$ . While this rapid decay ensures convergence and is perfectly adequate for convex optimization problems, it is often too aggressive for deep learning applications. In training deep neural networks, this continuous decay can cause the learning rate to diminish prematurely, resulting in the independent variables moving very little during the later stages of iteration and potentially halting learning before an optimal solution is reached.