Learn Before
Calculating Attention Weights (αi,j) in Transformers
The attention weight, denoted as , quantifies the relevance of position to position . In Transformer models, this weight is derived by applying a normalization function to the attention score, . The attention score itself is the rescaled dot product of the query vector and the key vector , potentially including a mask.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Calculating Attention Weights (αi,j) in Transformers
Relative Positional Encoding as a Query-Key Bias
In a sequence processing model, an intermediate score is calculated to determine the relationship between two elements. This score is found by taking the dot product of a 'query' vector and a 'key' vector, and then scaling the result by dividing by the square root of the vectors' dimension. Assume no other adjustments are made to the score.
Given the following information:
- Query vector:
[2.0, 0.5, 1.0, -1.5] - Key vector:
[1.0, 1.0, -0.5, 2.0] - Vector dimension: 4
What is the calculated intermediate score?
- Query vector:
In a transformer model designed for text generation, a masking mechanism is applied to the attention scores (βi,j) to prevent a token at position
ifrom attending to future tokens (positionsj > i). This is achieved by adding a large negative number (e.g., -∞) to the score before normalization. Consider the calculation of attention scores for a sequence of 4 tokens. Which of the following matrices correctly represents the application of this causal mask, where 'Score' indicates a calculated value and '-∞' indicates a masked value?Analyzing Training Instability in an Attention Mechanism
Learn After
Scaled Dot-Product Attention
Causal Self-Attention in Autoregressive Decoders
A model is processing a sequence of three tokens. For the query at position 2, the un-normalized attention scores with respect to the keys at positions 0, 1, and 2 are calculated as [1.0, 2.0, 3.0] respectively. What is the final attention weight that the token at position 2 will assign to the token at position 1?
Attention Output as a Weighted Sum of Values
Impact of Masking on Attention Weight Distribution
True or False: In a self-attention mechanism, if you add the same constant value to all un-normalized attention scores corresponding to a single query vector, the final normalized attention weights for that query will change.
Attention Weight Formula ()