Positional embedding methods vary in how they handle sequence lengths that exceed their training data. In a visual representation of these methods across a range of positions, positions observed during training (e.g., blue points) can be distinguished from newly observed positions at test time (e.g., red points). An encoding model that strictly memorizes the points seen during training cannot generalize to new positions outside that domain. However, models designed to generalize can successfully process newly observed positions through mechanisms such as extrapolation and interpolation.

Google

To overcome the limitations of fixed-length training, an alternative approach is to develop generalizable positional embeddings. Suppose an embedding model is trained on sequences with a maximum length of $$m_l$$. If the model can generalize, it can be applied to handle much longer sequences of length $$m$$ (where $$m \gg m_l$$) during inference. This capability allows the model to extrapolate and effectively deal with new positions outside the range observed in the training data.

Generalizable Positional Embeddings

Reference of Foundations of Large Language Models Course

Visualizing Positional Embedding Generalization

To overcome the limitations of positional embedding models when applied to sequences longer than those encountered during training, various generalization techniques exist. These approaches are typically categorized into two distinct classes.

Learn Before

Related