Learn Before
Visual Example of a Linear Relative Position Bias in Causal Attention
In causal self-attention, a linear relative position bias is applied to penalize attention to distant past tokens. The bias for a query at position and a key at position is calculated as , where is a scalar parameter. This bias is only applied to valid query-key pairs where , enforcing causality. For example, the set of computed query-key dot products for a sequence of length 7 (indexed 0-6) would form a lower-triangular structure: q0kT0, q1kT0, q1kT1, ..., q6kT0, ..., q6kT6. The bias added to each of these dot products would be zero for self-attention (e.g., q2kT2) and become increasingly negative for more distant pairs (e.g., the bias for q6kT0 would be more negative than for q6kT5).
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Formula for Attention Score with ALiBi Bias
Linear Relative Position Bias Example
In a sequence processing model, a positional bias is calculated to penalize attention scores based on the distance between tokens. The formula used is
Bias = -β ⋅ (i - j), whereiis the query position,jis the key position, andβis a fixed scalar. If the query token is at position 5, the key token is at position 2, andβ = 0.1, what is the calculated bias value?Visual Example of a Linear Relative Position Bias in Causal Attention
True or False: According to the positional bias formula
PE(i, j) = -β ⋅ (i - j), whereiis the query position,jis the key position, andβis a positive scalar, the penalty applied to the attention score decreases as the distance between the query and key tokens increases.Interpreting a Linear Positional Bias Value
Similarity of ALiBi Positional Biases to Length Features
Learn After
In a causal self-attention mechanism, a linear relative position bias is added to the attention scores. The bias for a query at position 'i' attending to a key at position 'j' is calculated as
B = -β * (i - j)forj ≤ i, where β is a positive scalar. How would the attention behavior of a model using a large positive β value (e.g., β = 1.0) compare to a model using a small positive β value (e.g., β = 0.1)?Calculating Linear Relative Position Bias
In a causal self-attention mechanism, a linear penalty is added to the query-key dot products based on their relative distance. The penalty for a query at position
iand a key at positionjis calculated as-β * (i - j)wherej ≤ iandβis a positive constant. For a query at position 4 (i=4), which of the following lists correctly represents the penalties applied to the keys at positions 0 through 4 (j=0, 1, 2, 3, 4) respectively?