Learn Before
Tuning the ALiBi Bias Scalar ()
In the ALiBi framework, the scalar hyperparameter , which dictates the magnitude of the positional penalty applied to query-key products, is generally optimized by tuning its value on a validation dataset to discover the most effective setting for a specific task.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
A language model computes its pre-normalized attention scores using the formula:
Score = (query_vector ⋅ key_vector + β ⋅ (key_position - query_position)) / sqrt(dimension). In this model, the scalar hyperparameterβis a fixed negative number. Consider a query token at positioni=10. How does the bias termβ ⋅ (key_position - query_position)influence the scores for a key token at positionj=12compared to a key token at positionj=20, assuming all other components of the score are equal for both keys?Calculating a Pre-Softmax Attention Score with Positional Bias
In a language model using the complete ALiBi attention formula for causal text generation, the model needs to prevent a query token at position
ifrom attending to any key token at a future positionj(wherej > i). How does theMask(i, j)term within the formulaα(i, j) = Softmax((q_iᵀk_j + β⋅(j-i))/√d + Mask(i, j))achieve this?Modeling Arbitrarily Long Sequences with ALiBi
Tuning the ALiBi Bias Scalar ()
Learn After
A machine learning team has just finished pre-training a language model using a two-part system. The first, smaller model corrupted text by replacing some words with plausible alternatives. The second, larger model was then trained to identify which words in the text were original and which were replacements. The team's ultimate goal is to use this work to build a system for classifying the sentiment of customer reviews. What is the most effective and standard next step for the team to take?
Impact of ALiBi Bias Scalar on Model Performance
A research team is fine-tuning a language model for a text summarization task. The model uses a positional encoding scheme where a scalar hyperparameter, β, adjusts the strength of a distance-based bias in the attention mechanism. The team experiments with different values for β and records the model's performance on a validation set using the ROUGE score (higher is better). The results are as follows:
β Value ROUGE Score 0.01 0.35 0.1 0.42 1.0 0.38 10.0 0.29 Based on this data, what is the most reasonable conclusion?
Geometric Progression for ALiBi's Scalar per Head