The output of a self-attention layer for a single query vector, $\mathbf{q}_i$, is computed as a weighted sum of all value vectors, $\mathbf{v}_j$, in the sequence. The attention weights, $\alpha_{i,j}$, which are calculated separately, determine the contribution of each value vector to the final output for the query. This relationship is expressed by the formula: $$\text{Att}_{\text{qkv}}(\mathbf{q}_i, \mathbf{K}, \mathbf{V}) = \sum_{j=0}^{m-1} \alpha_{i,j} \mathbf{v}_j$$ where $m$ is the sequence length.

Attention Output as a Weighted Sum of Values

In attention mechanisms, the Value matrix, denoted as $$\mathbf{V}$$, is a matrix that contains the set of value vectors for an input sequence. The dimensions of this matrix are $$i' \times d$$, where $$i'$$ is the sequence length (the number of value vectors) and $$d$$ is the dimension of each individual value vector. This is formally expressed as: $$\mathbf{V} \in \mathbb{R}^{i' \times d}$$

Value Matrix (V) in Attention

The multi-head self-attention function operates on an input representation matrix, $$\mathbf{H} \in \mathbb{R}^{m 	imes d}$$. Rather than using a single set of attention parameters, this mechanism employs $$h$$ parallel 'attention heads'. Each head has its own unique set of learnable weight matrices for Query, Key, and Value projections. An attention pooling function $$f$$—such as additive attention or scaled dot-product attention—is applied independently within each head. The outputs from all heads are then concatenated and projected through a final linear transformation to produce the layer's output. Because each head operates in its own learned subspace, different heads may focus on different parts of the input, enabling the model to jointly attend to information from multiple representational subspaces at different positions. This design allows the mechanism to express more sophisticated functions than a simple weighted average.

Multi-Head Self-Attention Function

The previous algorithm can work well, but the paper also introduced another modification. In this case we still calculate the scores based on the keys and a query and take the softmax of those. But instead of adding the input vectors multiplied by the scores we add another MLP/matrix called values. We pass all the vectors though this matrix, multiply each by the corresponding score and add all of them up


University of California, Berkeley

Now if we look at the previous modification. In this case the words similar to the current one will have bigger scores for words that are just similar to the current ones. We want to have the relevant words to have a big score rather than similar. So at this step instead of taking the dot product of the actual embedding not we pass those embeddings through a usual Dense neural network(no activation function) before calculating the scores. This matrix, MLP, is called the Key matrix. Also I would be good that at each time stamp we would pass the current vector through another matrix rather than Keys(because if we just pass it through Keys and the end the scores will also be just taken on the account of the similarity in between vectors). We call it the Query matrix/ MLP. So at each step we pass the current vector through the Query neural network and all other vectors through the Keys natural network. And then process just goes as before:

Self-Attention layer understanding - Step 2 - Keys, Queries 

A very good article on transformer model. A very in-depth analysis

http://jalammar.github.io/illustrated-transformer/

The Illustrated Transformer

A very influential paper that introduced the concept of Transformer model.
https://arxiv.org/abs/1706.03762

Attention Is All You Need

A very good video explaining the transformer model:
https://www.youtube.com/watch?v=rBCqOTEfxvg

Learn Before

Related

Learn After