Learn Before
Distributed Attention Weight Formula
When computing attention over long sequences distributed across multiple nodes, the standard formula for an attention weight, , is adapted to reflect parallel processing. The numerator, which is the exponentiated pre-softmax score , is calculated locally on the specific node containing the relevant data. The normalization factor in the denominator is expanded into a sum of independent summations computed across all nodes from 1 to : . Here, the notation signifies that is a row vector of the specific key sub-matrix, , located on node .
0
1
Tags
Foundations of Large Language Models
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
In a self-attention mechanism, the raw attention scores (β) for a single query vector with respect to three key vectors are calculated as [2.0, 1.0, 0.5]. To convert these scores into a probability distribution, a normalization function is applied. What is the resulting normalized attention weight (α) corresponding to the first key vector (score of 2.0)?
In a self-attention mechanism, a set of raw, unnormalized attention scores for a specific query are [1.5, 0.5, -1.0]. If a constant value of 10 is added to each of these scores, resulting in a new set of scores [11.5, 10.5, 9.0], how will the final normalized attention weights (the probability distribution) calculated from the new scores compare to the weights calculated from the original scores?
Calculating and Interpreting Attention Weights
Self-Attention Output Formula for a Single Query
Computing Attention Weights in Sequence Parallelism
Distributed Attention Weight Formula