Learn Before
In a self-attention mechanism, the raw attention scores (β) for a single query vector with respect to three key vectors are calculated as [2.0, 1.0, 0.5]. To convert these scores into a probability distribution, a normalization function is applied. What is the resulting normalized attention weight (α) corresponding to the first key vector (score of 2.0)?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
In a self-attention mechanism, the raw attention scores (β) for a single query vector with respect to three key vectors are calculated as [2.0, 1.0, 0.5]. To convert these scores into a probability distribution, a normalization function is applied. What is the resulting normalized attention weight (α) corresponding to the first key vector (score of 2.0)?
In a self-attention mechanism, a set of raw, unnormalized attention scores for a specific query are [1.5, 0.5, -1.0]. If a constant value of 10 is added to each of these scores, resulting in a new set of scores [11.5, 10.5, 9.0], how will the final normalized attention weights (the probability distribution) calculated from the new scores compare to the weights calculated from the original scores?
Calculating and Interpreting Attention Weights
Self-Attention Output Formula for a Single Query
Computing Attention Weights in Sequence Parallelism
Distributed Attention Weight Formula