Learn Before
Attention Score in Transformers ()
The attention score, denoted as , is the intermediate value computed between a query vector and a key vector before any normalization is applied. This score calculation involves a scaled dot product with an optional masking variable, defined by the formula:
In this equation, represents the dimension of the key vectors, and is the masking variable for , utilized to optionally block certain positions from attending to others.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
An engineer is designing a self-attention layer for a text processing model. They notice that as they increase the dimensionality (
d_k) of the query and key vectors, the training process becomes unstable, and the gradients used for learning become extremely small. Which of the following best explains this phenomenon and the standard solution implemented within the attention mechanism?Calculating a Scaled Attention Score
A transformer's self-attention layer calculates an output vector for each input token. Arrange the following computational steps in the correct sequence to produce a single output vector, based on its query vector and the full set of key and value vectors for the input sequence.
Attention Score in Transformers ()
Learn After
Calculating Attention Weights (αi,j) in Transformers
Relative Positional Encoding as a Query-Key Bias
In a sequence processing model, an intermediate score is calculated to determine the relationship between two elements. This score is found by taking the dot product of a 'query' vector and a 'key' vector, and then scaling the result by dividing by the square root of the vectors' dimension. Assume no other adjustments are made to the score.
Given the following information:
- Query vector:
[2.0, 0.5, 1.0, -1.5] - Key vector:
[1.0, 1.0, -0.5, 2.0] - Vector dimension: 4
What is the calculated intermediate score?
- Query vector:
In a transformer model designed for text generation, a masking mechanism is applied to the attention scores (βi,j) to prevent a token at position
ifrom attending to future tokens (positionsj > i). This is achieved by adding a large negative number (e.g., -∞) to the score before normalization. Consider the calculation of attention scores for a sequence of 4 tokens. Which of the following matrices correctly represents the application of this causal mask, where 'Score' indicates a calculated value and '-∞' indicates a masked value?Analyzing Training Instability in an Attention Mechanism