In the multi-head self-attention mechanism, the number of heads, denoted as `nhead`, is a key hyperparameter that must be specified. This value determines the number of different subspaces in which the attention mechanism operates, allowing the model to focus on different aspects of the input simultaneously. A larger `nhead` value corresponds to a greater number of attention subspaces. In practice, it is common to set the number of heads to four or more ($n_{head} \geq 4$).

Google

The multi-head self-attention function takes an input representation H, which is a matrix of dimensions m x d. Instead of a single set of attention parameters, this mechanism utilizes multiple 'heads', each with its own set of Query, Key, and Value weight matrices. For each head, a scaled dot-product attention operation is performed in parallel. The resulting output vectors from all heads are then concatenated and passed through a final linear transformation to produce the output. This approach allows the model to focus on different aspects of the input sequence simultaneously, capturing a richer set of relationships and features from different representational subspaces.

Self-Attention layer understanding - Step 4 - Multi Headed Attention

Reference of Foundations of Large Language Models Course

So as you may have noticed in the current state the order of the words do not matter at all. We can permute the sentence but the result would be the same. In this case instead of using RNN to account for the order we can calculate positional encoding for the each timestamp and just add it to the word embeddings(note that we do it once right after the embedding layer). That positional encoding is calculated so that projected vectors into Q/K/V vectors have some meaning full distance in between them. Here is the example of how to it is calculated for the 20 words (rows) with an embedding size of 512 (columns)

Self-Attention layer understanding - Step 5 - Adding the time

Number of Attention Heads (nhead)

In a multi-head attention mechanism, the Query, Key, and Value matrices for each individual head `j` are generated through separate linear transformations applied to the input representation $\mathbf{H}$. These transformations use unique weight matrices for each head, projecting the input into different subspaces. The formulas for these projections are:
$$\mathbf{Q}^{[j]} = \mathbf{H}\mathbf{W}_{j}^{q}$$
$$\mathbf{K}^{[j]} = \mathbf{H}\mathbf{W}_{j}^{k}$$
$$\mathbf{V}^{[j]} = \mathbf{H}\mathbf{W}_{j}^{v}$$
Here, $\mathbf{W}_{j}^{q}$, $\mathbf{W}_{j}^{k}$, and $\mathbf{W}_{j}^{v}$ are the learnable weight matrices for the j-th query, key, and value, respectively.

Query, Key, and Value Projections in Multi-Head Attention

In multi-head attention mechanisms, each individual attention head can be associated with a unique scalar value. This allows for different behaviors or biases to be applied on a per-head basis, as seen in techniques like ALiBi.

Learn Before

Related