Analyzing Components of an Attention Score Formula
Consider the following formula for calculating an unnormalized attention score between a query at position i and a key at position j in a sequence: Score(i, j) = (q_i ⋅ k_j^T + PE(i, j)) / √d + Mask(i, j). Explain the distinct contribution of the PE(i, j) term and the Mask(i, j) term to the final attention weight. How would the model's behavior likely change if each of these terms were individually removed?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Formula for Causal Attention
In a sequence processing model, the unnormalized attention score between a query at position
iand a key at positionjis calculated using the formula:Score(i, j) = (q_i ⋅ k_j + PE(i, j)) / √d. What is the primary function of thePE(i, j)term in this calculation?Analyzing Components of an Attention Score Formula
Diagnosing a Language Model's Performance Issue
Interpretation of Positional Bias as a Distance Penalty