Learn Before
In a self-attention mechanism, instead of directly comparing the raw input vectors of a sequence, each input vector is first multiplied by three separate, learned parameter matrices. This process creates three distinct representations of the original vector before they are used to calculate attention scores and output values. What is the primary analytical advantage of this approach over simply comparing the original input vectors to each other?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Introduce weight matrices in the transformer
Generation of Query, Key, and Value Vectors in Self-Attention
In a self-attention mechanism, instead of directly comparing the raw input vectors of a sequence, each input vector is first multiplied by three separate, learned parameter matrices. This process creates three distinct representations of the original vector before they are used to calculate attention scores and output values. What is the primary analytical advantage of this approach over simply comparing the original input vectors to each other?