Learn Before
Calculating a Pre-Softmax Attention Score with Positional Bias
Based on the provided scenario and formula, calculate the final pre-Softmax attention score. Break down your calculation to show how each component (scaled dot product, positional bias, and mask) contributes to the final result.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A language model computes its pre-normalized attention scores using the formula:
Score = (query_vector ⋅ key_vector + β ⋅ (key_position - query_position)) / sqrt(dimension). In this model, the scalar hyperparameterβis a fixed negative number. Consider a query token at positioni=10. How does the bias termβ ⋅ (key_position - query_position)influence the scores for a key token at positionj=12compared to a key token at positionj=20, assuming all other components of the score are equal for both keys?Calculating a Pre-Softmax Attention Score with Positional Bias
In a language model using the complete ALiBi attention formula for causal text generation, the model needs to prevent a query token at position
ifrom attending to any key token at a future positionj(wherej > i). How does theMask(i, j)term within the formulaα(i, j) = Softmax((q_iᵀk_j + β⋅(j-i))/√d + Mask(i, j))achieve this?Modeling Arbitrarily Long Sequences with ALiBi
Tuning the ALiBi Bias Scalar ()