Learn Before
Attention Weight with Relative Positional Encoding
The attention weight in a causal attention mechanism can be calculated by incorporating relative positional information directly into the attention score. The formula is: Here, the score is based on the dot product of the query vector and the key vector , scaled by the square root of the dimension . A relative positional encoding term, , is added to this score to inject information about the relative distance between positions i and j. The term is used to enforce causality by preventing attention to future positions (where ).

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Attention Weight with Relative Positional Encoding
A language model is designed to generate a sentence one word at a time, from beginning to end. To generate the word at a specific position
i, it uses an attention mechanism to weigh the importance of the words that came before it. Which of the following statements correctly analyzes the structural constraint required for this mechanism to function properly for this specific task?Formula for Attention Weight with Relative Positional Encoding
Analyzing Attention Mechanism Constraints
An autoregressive model is processing the input sequence 'The quick brown fox'. When calculating the output representation for the token 'brown' (the third token), which set of tokens can it attend to if a causal attention mechanism is being used?
Learn After
Consider the calculation of an attention weight, which determines the influence of an input at position
jon the output at a later positioni. The calculation is based on a formula that includes: 1) a similarity score between vectors from positionsiandj, 2) a term that depends on the relative distance betweeniandj, and 3) a masking component that prevents attending to positionskwherek > i. If the term that depends on the relative distance were removed from this calculation, what would be the primary consequence?Calculating Pre-Normalized Attention Scores
In a causal attention mechanism that incorporates relative positional information, consider the calculation of attention for an output at position
i. If the dot product of the query vector from positioniwith the key vector from positionjis identical to its dot product with the key vector from positionk(wherej ≠ k, and bothj, k < i), then the final attention weights assigned to positionsjandkwill also be identical.