Learn Before
Evaluating a Modification to Multi-Head Attention
Based on the case study, evaluate the modification used in Model B. Explain why introducing a unique, pre-defined scalar value for each attention head could lead to the observed specialization, and discuss one potential advantage and one potential disadvantage of this approach compared to the standard mechanism in Model A.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Evaluating a Modification to Multi-Head Attention
An engineer modifies a standard multi-head attention layer by multiplying the output of each attention head by a unique, pre-defined (non-learnable) scalar value before the final concatenation and projection. What is the most significant functional consequence of this modification?
Rationale for Per-Head Scalars in Attention Mechanisms
Geometric Progression for ALiBi's Scalar per Head