Learn Before
Generalization Limit of Offset-Specific Biases
A major disadvantage of allocating a unique learnable value to every possible sequence offset is that the model becomes rigidly tied to the distances it observed during training. If the architecture processes sequences where the offset is greater than the maximum distance encountered in the training phase, it lacks the appropriate learned variables for those extended distances, preventing effective generalization.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Generalization Limit of Offset-Specific Biases
Calculating Positional Bias from Offset
In a self-attention mechanism that uses a shared, learnable parameter for each unique relative position offset, which of the following query-key pairs will share the exact same positional bias parameter as the pair with a query at position 8 and a key at position 3?
T5 Bias for Relative Positional Embedding
Parameter Implications of Offset-Based Positional Bias
Learn After
A language model is trained exclusively on text sequences with a maximum length of 1024 tokens. Its design includes a component where a unique, learnable numerical bias is assigned to every possible relative distance between token pairs (e.g., a specific bias for a distance of 1, another for a distance of 2, up to the maximum possible distance in the training data). What is the most likely outcome when this model is later tasked with processing a document of 1500 tokens?
Critique of a Relative Positional Bias Method
Diagnosing Model Generalization Failure