Single-Query Attention Computation with Multiplicative Scaling
The attention output for a single query vector, , is computed based on the key matrix and value matrix . This formulation calculates attention scores by taking the dot product of the query with the transposed key matrix and scaling the result by multiplying with . The Softmax function converts these scores into attention weights, which are then used to produce a weighted sum of the value vectors. The formula is:

0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Distributed Computation of Weighted Value Sums
Single-Query Attention Computation with Multiplicative Scaling
Calculating an Attention Output Vector
In a self-attention mechanism, the output for a given input element is a weighted sum of 'value' vectors from all elements in the sequence. Consider the calculation for the word 'sat' in the phrase 'The cat sat on the mat'. If the attention weights from 'sat' to the other words are: 'The': 0.05, 'cat': 0.45, 'sat': 0.05, 'on': 0.0, 'the': 0.0, 'mat': 0.45. Which of the following statements best describes the resulting output vector for 'sat'?
In a self-attention mechanism, the output for a specific token is calculated as a weighted sum of 'value' vectors from all tokens in the sequence. If the attention weight connecting a query token to a specific value token is exactly zero, that value token has no contribution to the final output for the query token.
Sequence Parallelism
Single-Query Attention Computation with Multiplicative Scaling
Scaled Dot-Product Attention
General Attention Formula
Value Matrix for Causal Attention (V_≤i)
Value Matrix from a Sliding Window
An attention mechanism processes an input sequence of 20 tokens, where each token is represented by a 256-dimensional vector. A Value matrix (V) is generated as part of this process. Which of the following statements most accurately describes the properties and role of this V matrix?
Determining Value Matrix Dimensions
Debugging an Attention Mechanism
Learn After
Causal Attention
In an attention mechanism, the scores for a query vector
qare calculated by taking its dot product with a set of key vectorsK. These scores are then scaled by a factor related to the vector dimension before being passed to a Softmax function to produce weights. A developer implements this but omits the scaling step, using the formulaSoftmax(q * K^T) * V. What is the most likely adverse effect of this omission, especially when the dimension of the key vectors is large?Calculating Pre-Softmax Attention Scores
Applying Scaled Dot-Product Attention