Learn Before
Complete ALiBi Attention Formula
The final attention weight in the ALiBi framework, denoted as , is computed by applying the Softmax function to the attention score. This score is derived by adding the ALiBi positional bias term, , to the standard query-key product , scaling the sum by the inverse square root of the dimension , and incorporating an optional mask. The complete equation is expressed as: In this formula, and denote the query and key vectors, and acts as a scaling factor. The term ensures proper attention masking when required.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Complete ALiBi Attention Formula
Calculating a Pre-Softmax Attention Score with Linear Bias
In a model that adds a linear positional bias to its attention calculation, a query at position
i=10attends to two keys at positionsj1=5andj2=2. Assuming the scaled dot-product portion of the score is identical for both keys, how will the addition of the positional bias termPE(i, j)affect their final pre-Softmax attention scores?Interaction of Semantic and Positional Scores
Learn After
A language model computes its pre-normalized attention scores using the formula:
Score = (query_vector ⋅ key_vector + β ⋅ (key_position - query_position)) / sqrt(dimension). In this model, the scalar hyperparameterβis a fixed negative number. Consider a query token at positioni=10. How does the bias termβ ⋅ (key_position - query_position)influence the scores for a key token at positionj=12compared to a key token at positionj=20, assuming all other components of the score are equal for both keys?Calculating a Pre-Softmax Attention Score with Positional Bias
In a language model using the complete ALiBi attention formula for causal text generation, the model needs to prevent a query token at position
ifrom attending to any key token at a future positionj(wherej > i). How does theMask(i, j)term within the formulaα(i, j) = Softmax((q_iᵀk_j + β⋅(j-i))/√d + Mask(i, j))achieve this?Modeling Arbitrarily Long Sequences with ALiBi
Tuning the ALiBi Bias Scalar ()