Learn Before
Modeling Arbitrarily Long Sequences with ALiBi
The Attention with Linear Biases (ALiBi) mechanism functions by adding a fixed scalar penalty to the query-key product () for each incremental step the key position () moves away from the query position (). By relying on this consistent, step-wise penalty rather than a predetermined length limit, the model does not need to adapt to a restricted range of sequence lengths and can be seamlessly employed to process arbitrarily long sequences.
0
1
Tags
Foundations of Large Language Models
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
A language model computes its pre-normalized attention scores using the formula:
Score = (query_vector ⋅ key_vector + β ⋅ (key_position - query_position)) / sqrt(dimension). In this model, the scalar hyperparameterβis a fixed negative number. Consider a query token at positioni=10. How does the bias termβ ⋅ (key_position - query_position)influence the scores for a key token at positionj=12compared to a key token at positionj=20, assuming all other components of the score are equal for both keys?Calculating a Pre-Softmax Attention Score with Positional Bias
In a language model using the complete ALiBi attention formula for causal text generation, the model needs to prevent a query token at position
ifrom attending to any key token at a future positionj(wherej > i). How does theMask(i, j)term within the formulaα(i, j) = Softmax((q_iᵀk_j + β⋅(j-i))/√d + Mask(i, j))achieve this?Modeling Arbitrarily Long Sequences with ALiBi
Tuning the ALiBi Bias Scalar ()