Learn Before
In a language model using the complete ALiBi attention formula for causal text generation, the model needs to prevent a query token at position i from attending to any key token at a future position j (where j > i). How does the Mask(i, j) term within the formula α(i, j) = Softmax((q_iᵀk_j + β⋅(j-i))/√d + Mask(i, j)) achieve this?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A language model computes its pre-normalized attention scores using the formula:
Score = (query_vector ⋅ key_vector + β ⋅ (key_position - query_position)) / sqrt(dimension). In this model, the scalar hyperparameterβis a fixed negative number. Consider a query token at positioni=10. How does the bias termβ ⋅ (key_position - query_position)influence the scores for a key token at positionj=12compared to a key token at positionj=20, assuming all other components of the score are equal for both keys?Calculating a Pre-Softmax Attention Score with Positional Bias
In a language model using the complete ALiBi attention formula for causal text generation, the model needs to prevent a query token at position
ifrom attending to any key token at a future positionj(wherej > i). How does theMask(i, j)term within the formulaα(i, j) = Softmax((q_iᵀk_j + β⋅(j-i))/√d + Mask(i, j))achieve this?Modeling Arbitrarily Long Sequences with ALiBi
Tuning the ALiBi Bias Scalar ()