Causal Attention
Causal attention is a type of self-attention mechanism where a query at a specific position i can only attend to keys and values at positions less than or equal to i (K_<=i, V_<=i). This restriction, often implemented using a mask, ensures that the model's prediction for a token only depends on the preceding tokens and not on future ones. The computation is formally expressed as Att_qkv(q_i, K_<=i, V_<=i).
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Causal Attention
In an attention mechanism, the scores for a query vector
qare calculated by taking its dot product with a set of key vectorsK. These scores are then scaled by a factor related to the vector dimension before being passed to a Softmax function to produce weights. A developer implements this but omits the scaling step, using the formulaSoftmax(q * K^T) * V. What is the most likely adverse effect of this omission, especially when the dimension of the key vectors is large?Calculating Pre-Softmax Attention Scores
Applying Scaled Dot-Product Attention
Learn After
In a generative language model, an attention mechanism processes a sequence of 4 tokens. To ensure that the prediction for each token only depends on the preceding tokens and itself, a mask is applied to the raw attention score matrix before the final weighting step. Given the initial score matrix below, where rows represent the 'query' token and columns represent the 'key' token, which of the following matrices correctly shows the result of applying this causal mask? (Note: '-inf' represents a very large negative number that effectively nullifies the score.)
Initial Matrix: [[ 0.8, 1.2, 0.5, 2.1 ], [ 1.5, 0.6, 1.9, 0.3 ], [ 0.9, 2.2, 1.1, 0.7 ], [ 1.3, 0.4, 1.6, 0.2 ]]
Consequences of Misconfigured Attention in Generative Models
Appropriate Application of an Attention Mechanism