The Adagrad optimization algorithm addresses the difficulty of tuning learning rates for sparse features by replacing simple feature occurrence counters with an aggregate of the squares of previously observed gradients. Specifically, it uses $$s(i, t+1) = s(i, t) + \left(\partial_i f(\mathbf{x})\right)^2$$ to adjust the learning rate. This automatically scales down the step size significantly for coordinates that frequently have large gradients, while applying a gentler treatment to coordinates with small gradients, thereby eliminating the need to manually decide when a gradient is considered large enough.

Adagrad

To address the learning rate dilemma for sparse features, one approach is to adjust the learning rate based on feature occurrence. Instead of a global time-based decay $$\eta = \frac{\eta_0}{\sqrt{t + c}}$$, a feature-specific rate $$\eta_i = \frac{\eta_0}{\sqrt{s(i, t) + c}}$$ can be used, where $$s(i, t)$$ counts the number of nonzeros for feature $$i$$ observed up to time $$t$$. However, this method fails for data that is not strictly sparse but instead has gradients that are mostly very small and only rarely large, as it is difficult to define a clear threshold for counting a feature as observed.

Claude

When training models on sparse features, using a standard decreasing learning rate, such as $$\mathcal{O}(t^{-\frac{1}{2}})$$, creates an optimization dilemma. If the learning rate decreases too quickly, the parameters for infrequent features will not be updated sufficiently to reach their optimal values when they finally appear. Conversely, if the learning rate decreases too slowly to accommodate these infrequent features, the parameters for common features will fail to converge quickly.

Learn Before

Related

Learn After