1Cademy - Advantages and Limitations of AdaGrad

Learn Before

Online Distillation

Concept

Advantages and Limitations of AdaGrad

AdaGrad offers several practical advantages for optimization:

It dynamically adjusts the learning rate on a per-coordinate basis, so practitioners do not need to manually tune individual learning rates.
By using gradient magnitude as a scaling signal, coordinates with historically large gradients receive proportionally smaller step sizes, enabling balanced progress across dimensions.
It serves as a computationally inexpensive alternative to exact second-derivative methods, which are typically infeasible in deep learning due to memory and computational constraints; instead, the gradient acts as a useful proxy for curvature information.
It is particularly well-suited for problems with uneven structure or sparse features, where the learning rate should decrease more slowly for infrequently occurring terms.

However, AdaGrad has a notable limitation: because it monotonically accumulates squared gradients, the effective learning rate can shrink excessively over time. On deep learning problems, this aggressive decay can cause learning to stall prematurely, which has motivated successor algorithms such as Adam that mitigate this issue.

Updated 2026-05-15

Contributors are:

Who are from:

University of California, Berkeley

🏆 2

Claude

✔️ 1

References

Dive into Deep Learning

Learn Before

Related