Learn Before
Relative Positional Encoding as a Query-Key Bias
Rather than modifying the initial input token embeddings, an alternative self-attention architecture integrates positional awareness directly into the core interaction calculation. It achieves this by adding a relative positional bias term, represented as , directly to the query-key product, which structurally alters the attention score between position and position .
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Calculating Attention Weights (αi,j) in Transformers
Relative Positional Encoding as a Query-Key Bias
In a sequence processing model, an intermediate score is calculated to determine the relationship between two elements. This score is found by taking the dot product of a 'query' vector and a 'key' vector, and then scaling the result by dividing by the square root of the vectors' dimension. Assume no other adjustments are made to the score.
Given the following information:
- Query vector:
[2.0, 0.5, 1.0, -1.5] - Key vector:
[1.0, 1.0, -0.5, 2.0] - Vector dimension: 4
What is the calculated intermediate score?
- Query vector:
In a transformer model designed for text generation, a masking mechanism is applied to the attention scores (βi,j) to prevent a token at position
ifrom attending to future tokens (positionsj > i). This is achieved by adding a large negative number (e.g., -∞) to the score before normalization. Consider the calculation of attention scores for a sequence of 4 tokens. Which of the following matrices correctly represents the application of this causal mask, where 'Score' indicates a calculated value and '-∞' indicates a masked value?Analyzing Training Instability in an Attention Mechanism
Learn After
Interpretation of Positional Bias as a Distance Penalty
T5 Bias for Relative Positional Embedding
Shared Learnable Bias per Offset
Heuristic-Based Relative Positional Biases
Comparison of Learned vs. Heuristic-Based Relative Positional Biases
Kerple
FIRE
Relative Position Offset Calculation
A self-attention model incorporates positional awareness by adding a bias term directly to the query-key dot product for each pair of positions
(i, j). This bias term's value depends on the relative distance betweeniandj. What is the primary implication of this approach compared to the alternative of adding positional vectors to the input token embeddings?Incorporating Positional Bias into Attention Scores
In a self-attention mechanism, the score computed between a query at position
iand a key at positionjis modified by directly adding a bias term whose value depends only on the positionsiandj. What is the primary function of this bias term within the attention calculation?