1Cademy - Attention Output as a Weighted Sum of Values

Learn Before

Self-Attention layer understanding - Step 3 - Values
Calculating Attention Weights (αi,j) in Transformers

Formula

Attention Output as a Weighted Sum of Values

The output of a self-attention layer for a single query vector, $\mathbf{q}_i$ , is computed as a weighted sum of all value vectors, $\mathbf{v}_j$ , in the sequence. The attention weights, $\alpha_{i,j}$ , which are calculated separately, determine the contribution of each value vector to the final output for the query. This relationship is expressed by the formula: $\text{Att}_{\text{qkv}}(\mathbf{q}_i, \mathbf{K}, \mathbf{V}) = \sum_{j=0}^{m-1} \alpha_{i,j} \mathbf{v}_j$ where $m$ is the sequence length.