Learn Before
Linear Relative Position Bias Example
A linear relative position bias scheme incorporates sequence order into attention mechanisms by adding a penalty term, calculated as , to the query-key dot product. In this formula, is the relative distance between the query and key, and is a scalar, resulting in a penalty that grows linearly with distance. In a causal attention setting, where a query only attends to previous keys, the bias values for different maximum relative distances are as follows:
- For relative distances of 3, 2, 1, and 0, the biases are:
- For relative distances of 4, 3, 2, 1, and 0, the biases are:
- For relative distances of 5, 4, 3, 2, 1, and 0, the biases are:
- For relative distances of 6, 5, 4, 3, 2, 1, and 0, the biases are:
This pattern shows that the bias is zero for self-attention (when ) and increases as a negative penalty for positions that are further apart.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Formula for Attention Score with ALiBi Bias
Linear Relative Position Bias Example
In a sequence processing model, a positional bias is calculated to penalize attention scores based on the distance between tokens. The formula used is
Bias = -β ⋅ (i - j), whereiis the query position,jis the key position, andβis a fixed scalar. If the query token is at position 5, the key token is at position 2, andβ = 0.1, what is the calculated bias value?Visual Example of a Linear Relative Position Bias in Causal Attention
True or False: According to the positional bias formula
PE(i, j) = -β ⋅ (i - j), whereiis the query position,jis the key position, andβis a positive scalar, the penalty applied to the attention score decreases as the distance between the query and key tokens increases.Interpreting a Linear Positional Bias Value
Similarity of ALiBi Positional Biases to Length Features
Learn After
In a text-processing model, a bias term is added to the attention scores between a 'query' word and a 'key' word before the final attention weights are computed. This bias is calculated as
-β * d, wheredis the distance (number of words) between the query and the key, andβis a fixed positive number. What is the primary effect of this biasing scheme on the model's behavior?Calculating Linear Relative Position Biases
An attention mechanism uses a linear relative position bias to penalize distant key-value pairs. In a causal setting, a query at a given position attends to itself and all previous positions up to a certain maximum distance. Match each maximum relative distance to the corresponding set of bias values that would be applied, where β is a scalar.