An engineer is configuring a multi-head attention layer with 8 heads that uses a linear positional bias. Instead of tuning a separate bias scalar (β) for each head, they set the values to form a decreasing geometric sequence (e.g., Head 1 β=0.5, Head 2 β=0.25, Head 3 β=0.125, and so on). What is the primary advantage of this configuration strategy?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Geometric Progression Formula for ALiBi's β Scalar per Head
Evaluating Strategies for Setting Positional Bias Scalars
An engineer is configuring a multi-head attention layer with 8 heads that uses a linear positional bias. Instead of tuning a separate bias scalar (β) for each head, they set the values to form a decreasing geometric sequence (e.g., Head 1 β=0.5, Head 2 β=0.25, Head 3 β=0.125, and so on). What is the primary advantage of this configuration strategy?
Rationale for Geometric Progression in Positional Bias