In a causal attention mechanism that incorporates relative positional information, consider the calculation of attention for an output at position i. If the dot product of the query vector from position i with the key vector from position j is identical to its dot product with the key vector from position k (where j ≠ k, and both j, k < i), then the final attention weights assigned to positions j and k will also be identical.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Consider the calculation of an attention weight, which determines the influence of an input at position
jon the output at a later positioni. The calculation is based on a formula that includes: 1) a similarity score between vectors from positionsiandj, 2) a term that depends on the relative distance betweeniandj, and 3) a masking component that prevents attending to positionskwherek > i. If the term that depends on the relative distance were removed from this calculation, what would be the primary consequence?Calculating Pre-Normalized Attention Scores
In a causal attention mechanism that incorporates relative positional information, consider the calculation of attention for an output at position
i. If the dot product of the query vector from positioniwith the key vector from positionjis identical to its dot product with the key vector from positionk(wherej ≠ k, and bothj, k < i), then the final attention weights assigned to positionsjandkwill also be identical.