A model is processing a sequence of three tokens. For the query at position 2, the un-normalized attention scores with respect to the keys at positions 0, 1, and 2 are calculated as [1.0, 2.0, 3.0] respectively. What is the final attention weight that the token at position 2 will assign to the token at position 1?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Scaled Dot-Product Attention
Causal Self-Attention in Autoregressive Decoders
A model is processing a sequence of three tokens. For the query at position 2, the un-normalized attention scores with respect to the keys at positions 0, 1, and 2 are calculated as [1.0, 2.0, 3.0] respectively. What is the final attention weight that the token at position 2 will assign to the token at position 1?
Attention Output as a Weighted Sum of Values
Impact of Masking on Attention Weight Distribution
True or False: In a self-attention mechanism, if you add the same constant value to all un-normalized attention scores corresponding to a single query vector, the final normalized attention weights for that query will change.
Attention Weight Formula ()