Learn Before
Concept

Aggressive Learning Rate Decay in AdaGrad

A key characteristic of AdaGrad is that it accumulates squared gradients in a state variable st\mathbf{s}_t. Because this sum continuously grows at an approximately linear rate, the effective per-coordinate learning rate decays at a rate of O(t12)\mathcal{O}(t^{-\frac{1}{2}}). While this rapid decay ensures convergence and is perfectly adequate for convex optimization problems, it is often too aggressive for deep learning applications. In training deep neural networks, this continuous decay can cause the learning rate to diminish prematurely, resulting in the independent variables moving very little during the later stages of iteration and potentially halting learning before an optimal solution is reached.

0

1

Updated 2026-05-15

Tags

Deep Learning (in Machine learning)

D2L

Dive into Deep Learning @ D2L

Data Science

Computing Sciences