Consider the calculation of an attention weight, which determines the influence of an input at position j on the output at a later position i. The calculation is based on a formula that includes: 1) a similarity score between vectors from positions i and j, 2) a term that depends on the relative distance between i and j, and 3) a masking component that prevents attending to positions k where k > i. If the term that depends on the relative distance were removed from this calculation, what would be the primary consequence?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Consider the calculation of an attention weight, which determines the influence of an input at position
jon the output at a later positioni. The calculation is based on a formula that includes: 1) a similarity score between vectors from positionsiandj, 2) a term that depends on the relative distance betweeniandj, and 3) a masking component that prevents attending to positionskwherek > i. If the term that depends on the relative distance were removed from this calculation, what would be the primary consequence?Calculating Pre-Normalized Attention Scores
In a causal attention mechanism that incorporates relative positional information, consider the calculation of attention for an output at position
i. If the dot product of the query vector from positioniwith the key vector from positionjis identical to its dot product with the key vector from positionk(wherej ≠ k, and bothj, k < i), then the final attention weights assigned to positionsjandkwill also be identical.