Learn Before
Computing Attention Weights in Sequence Parallelism
When calculating attention weights across multiple computing nodes in a sequence parallel setup, determining the numerator of the Softmax formula, , is straightforward because the required keys and values are stored locally. However, computing the normalization denominator requires a summation over all positions , denoted as . Because the sequence data is partitioned, calculating this total sum necessitates transferring data to and from other computing nodes. If keys and values are placed on a specific node , the attention weight can be calculated by explicitly dividing the denominator into sums over the key subsets located on nodes through :
In this equation, the notation indicates that is a row vector belonging to the sub-matrix . This rewritten formula illustrates that while the numerator is computed entirely on node , the denominator must aggregate partial sums from all nodes.
0
1
Tags
Foundations of Large Language Models
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Resolving Memory Bottlenecks in Attention Mechanisms
A machine learning team is processing an extremely long input sequence and wants to parallelize the self-attention computation across 4 GPUs using sequence parallelism. For a single attention head, which of the following strategies correctly describes how the Key (K) and Value (V) matrices should be partitioned and distributed?
Flawed Parallel Attention Implementation
Computing Attention Weights in Sequence Parallelism
Motivation for Sequence Parallelism
In a self-attention mechanism, the raw attention scores (β) for a single query vector with respect to three key vectors are calculated as [2.0, 1.0, 0.5]. To convert these scores into a probability distribution, a normalization function is applied. What is the resulting normalized attention weight (α) corresponding to the first key vector (score of 2.0)?
In a self-attention mechanism, a set of raw, unnormalized attention scores for a specific query are [1.5, 0.5, -1.0]. If a constant value of 10 is added to each of these scores, resulting in a new set of scores [11.5, 10.5, 9.0], how will the final normalized attention weights (the probability distribution) calculated from the new scores compare to the weights calculated from the original scores?
Calculating and Interpreting Attention Weights
Self-Attention Output Formula for a Single Query
Computing Attention Weights in Sequence Parallelism
Distributed Attention Weight Formula