Evaluating Strategies for Setting Positional Bias Scalars
Based on the scenario presented, critique Researcher B's approach. Which of the two strategies is more likely to produce a robust model that performs well across a variety of tasks without requiring extensive, task-specific tuning? Justify your reasoning.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Geometric Progression Formula for ALiBi's β Scalar per Head
Evaluating Strategies for Setting Positional Bias Scalars
An engineer is configuring a multi-head attention layer with 8 heads that uses a linear positional bias. Instead of tuning a separate bias scalar (β) for each head, they set the values to form a decreasing geometric sequence (e.g., Head 1 β=0.5, Head 2 β=0.25, Head 3 β=0.125, and so on). What is the primary advantage of this configuration strategy?
Rationale for Geometric Progression in Positional Bias