Attention Weight Formula ()
The attention weight, denoted as , is obtained by applying the Softmax function to the pre-softmax attention score . This normalization step converts the raw scores into a probability distribution, ensuring that the weights for a given position sum to one across all positions . The formula is expressed as:

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Scaled Dot-Product Attention
Causal Self-Attention in Autoregressive Decoders
A model is processing a sequence of three tokens. For the query at position 2, the un-normalized attention scores with respect to the keys at positions 0, 1, and 2 are calculated as [1.0, 2.0, 3.0] respectively. What is the final attention weight that the token at position 2 will assign to the token at position 1?
Attention Output as a Weighted Sum of Values
Impact of Masking on Attention Weight Distribution
True or False: In a self-attention mechanism, if you add the same constant value to all un-normalized attention scores corresponding to a single query vector, the final normalized attention weights for that query will change.
Attention Weight Formula ()
Learn After
In a self-attention mechanism, the raw attention scores (β) for a single query vector with respect to three key vectors are calculated as [2.0, 1.0, 0.5]. To convert these scores into a probability distribution, a normalization function is applied. What is the resulting normalized attention weight (α) corresponding to the first key vector (score of 2.0)?
In a self-attention mechanism, a set of raw, unnormalized attention scores for a specific query are [1.5, 0.5, -1.0]. If a constant value of 10 is added to each of these scores, resulting in a new set of scores [11.5, 10.5, 9.0], how will the final normalized attention weights (the probability distribution) calculated from the new scores compare to the weights calculated from the original scores?
Calculating and Interpreting Attention Weights
Self-Attention Output Formula for a Single Query
Computing Attention Weights in Sequence Parallelism
Distributed Attention Weight Formula