Interpretation of Positional Bias as a Distance Penalty
In self-attention with relative positional embeddings, the bias term added to the query-key product can intuitively be interpreted as a distance penalty between positions and . To reflect that tokens further apart should generally have less influence on each other, the value of decreases as the token at position moves further away from the token at position .
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Interpretation of Positional Bias as a Distance Penalty
T5 Bias for Relative Positional Embedding
Shared Learnable Bias per Offset
Heuristic-Based Relative Positional Biases
Comparison of Learned vs. Heuristic-Based Relative Positional Biases
Kerple
FIRE
Relative Position Offset Calculation
A self-attention model incorporates positional awareness by adding a bias term directly to the query-key dot product for each pair of positions
(i, j). This bias term's value depends on the relative distance betweeniandj. What is the primary implication of this approach compared to the alternative of adding positional vectors to the input token embeddings?Incorporating Positional Bias into Attention Scores
In a self-attention mechanism, the score computed between a query at position
iand a key at positionjis modified by directly adding a bias term whose value depends only on the positionsiandj. What is the primary function of this bias term within the attention calculation?Formula for Causal Attention
In a sequence processing model, the unnormalized attention score between a query at position
iand a key at positionjis calculated using the formula:Score(i, j) = (q_i ⋅ k_j + PE(i, j)) / √d. What is the primary function of thePE(i, j)term in this calculation?Analyzing Components of an Attention Score Formula
Diagnosing a Language Model's Performance Issue
Interpretation of Positional Bias as a Distance Penalty
Learn After
In a self-attention mechanism, a bias term is added to the score calculated between a query at position
iand a key at positionj. Consider a scenario where this bias term is designed to be a large negative value when the distance|i - j|is large, and it approaches zero as the distance gets smaller. How would this specific design influence the model's behavior?Diagnosing Model Behavior via Positional Bias
Designing a Positional Bias for a Specific Task