To address the learning rate dilemma for sparse features, one approach is to adjust the learning rate based on feature occurrence. Instead of a global time-based decay $$\eta = \frac{\eta_0}{\sqrt{t + c}}$$, a feature-specific rate $$\eta_i = \frac{\eta_0}{\sqrt{s(i, t) + c}}$$ can be used, where $$s(i, t)$$ counts the number of nonzeros for feature $$i$$ observed up to time $$t$$. However, this method fails for data that is not strictly sparse but instead has gradients that are mostly very small and only rarely large, as it is difficult to define a clear threshold for counting a feature as observed.

Feature Count-Based Learning Rate Adjustment

When training models on sparse features, using a standard decreasing learning rate, such as $$\mathcal{O}(t^{-\frac{1}{2}})$$, creates an optimization dilemma. If the learning rate decreases too quickly, the parameters for infrequent features will not be updated sufficiently to reach their optimal values when they finally appear. Conversely, if the learning rate decreases too slowly to accommodate these infrequent features, the parameters for common features will fail to converge quickly.

Claude

In machine learning domains such as natural language processing, computational advertising, and personalized collaborative filtering, it is common to encounter sparse features. These are features that occur only infrequently in the data, meaning that the parameters associated with them only receive meaningful updates during training when those rare features are actually observed.

Sparse Features in Machine Learning

Dive into Deep Learning

Learning Rate Dilemma for Sparse Features

In natural language processing, a classic example of a sparse feature is the occurrence of a rare word within a text corpus. For instance, the word 'preconditioning' appears much less frequently than a common word like 'learning'. Consequently, the parameter or feature associated with 'preconditioning' is mostly inactive and only receives meaningful updates on the rare occasions it is actually observed. This characteristic of infrequent occurrence is also prevalent in other practical domains, such as computational advertising and personalized collaborative filtering, where specific items are typically of interest to only a small subset of users.

Learn Before

Related

Learn After