Calculating Pre-Normalized Attention Scores
A language model is calculating attention scores to determine the influence of two previous tokens (at positions j=2 and j=4) on the current token being generated (at position i=5). The score before normalization is calculated by adding a query-key similarity value to a relative positional encoding value. Based on the data provided in the case study, which of the two previous tokens will receive a higher attention score? Justify your answer by calculating the pre-normalized score for each position.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Consider the calculation of an attention weight, which determines the influence of an input at position
jon the output at a later positioni. The calculation is based on a formula that includes: 1) a similarity score between vectors from positionsiandj, 2) a term that depends on the relative distance betweeniandj, and 3) a masking component that prevents attending to positionskwherek > i. If the term that depends on the relative distance were removed from this calculation, what would be the primary consequence?Calculating Pre-Normalized Attention Scores
In a causal attention mechanism that incorporates relative positional information, consider the calculation of attention for an output at position
i. If the dot product of the query vector from positioniwith the key vector from positionjis identical to its dot product with the key vector from positionk(wherej ≠ k, and bothj, k < i), then the final attention weights assigned to positionsjandkwill also be identical.