Learn Before
Concept
Advantages and Limitations of AdaGrad
AdaGrad offers several practical advantages for optimization:
- It dynamically adjusts the learning rate on a per-coordinate basis, so practitioners do not need to manually tune individual learning rates.
- By using gradient magnitude as a scaling signal, coordinates with historically large gradients receive proportionally smaller step sizes, enabling balanced progress across dimensions.
- It serves as a computationally inexpensive alternative to exact second-derivative methods, which are typically infeasible in deep learning due to memory and computational constraints; instead, the gradient acts as a useful proxy for curvature information.
- It is particularly well-suited for problems with uneven structure or sparse features, where the learning rate should decrease more slowly for infrequently occurring terms.
However, AdaGrad has a notable limitation: because it monotonically accumulates squared gradients, the effective learning rate can shrink excessively over time. On deep learning problems, this aggressive decay can cause learning to stall prematurely, which has motivated successor algorithms such as Adam that mitigate this issue.
0
1
Updated 2026-05-15
Contributors are:
Who are from:
Tags
Deep Learning (in Machine learning)
D2L
Dive into Deep Learning @ D2L
Data Science
Computing Sciences