Learn Before
Comparing Positional Bias Functions
Consider two different methods for applying a positional penalty to the attention scores in a transformer model. Both penalties are negative and their magnitude increases as the distance between a query and a key grows.
- Method A (Linear): The penalty's magnitude increases at a constant rate with distance (e.g., a penalty of -1 for distance 1, -2 for distance 2, -10 for distance 10).
- Method B (Sub-linear): The penalty's magnitude increases sharply for short distances but then grows much more slowly for longer distances (e.g., using a logarithmic function).
Analyze the potential difference in a model's attention behavior when using Method A versus Method B, particularly regarding how it handles short-range versus long-range dependencies.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Kerple Positional Bias Formula
Kerple Logarithmic Bias Formula
Sandwich Method (Chi et al., 2023)
Formula for Relative Position Scaled by Sinusoidal Wavelength
A transformer model incorporates a positional bias mechanism where a penalty is applied to the attention score between a query and a key. This penalty grows larger as the distance between the query's position and the key's position in the sequence increases. Given the sentence 'The quick brown fox jumps over the lazy dog', which of the following query-key pairs would receive the smallest penalty from this mechanism?
Comparing Positional Bias Functions
A self-attention mechanism is modified to include a bias term that systematically penalizes attention scores between pairs of tokens. The magnitude of this penalty increases as the distance between the tokens' positions in the sequence grows. For which of the following tasks would this modification be most likely to hinder the model's performance?