Learn Before
Rationale for Per-Head Scalars in Attention Mechanisms
In a multi-head attention layer, instead of applying a single, uniform modification to the combined output, some architectures associate a unique scalar value with each individual attention head. Analyze the primary advantage of this per-head approach. Why is it more powerful or flexible than applying a single scalar to the entire layer's output after the heads are concatenated?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Evaluating a Modification to Multi-Head Attention
An engineer modifies a standard multi-head attention layer by multiplying the output of each attention head by a unique, pre-defined (non-learnable) scalar value before the final concatenation and projection. What is the most significant functional consequence of this modification?
Rationale for Per-Head Scalars in Attention Mechanisms
Geometric Progression for ALiBi's Scalar per Head