Concept

Adagrad

The Adagrad optimization algorithm addresses the difficulty of tuning learning rates for sparse features by replacing simple feature occurrence counters with an aggregate of the squares of previously observed gradients. Specifically, it uses s(i,t+1)=s(i,t)+(if(x))2s(i, t+1) = s(i, t) + \left(\partial_i f(\mathbf{x})\right)^2 to adjust the learning rate. This automatically scales down the step size significantly for coordinates that frequently have large gradients, while applying a gentler treatment to coordinates with small gradients, thereby eliminating the need to manually decide when a gradient is considered large enough.

0

2

Updated 2026-05-15

Tags

Data Science

D2L

Dive into Deep Learning @ D2L