In Multi-Query Attention (MQA), the output for an individual head `j` is calculated using its unique query vector, $\mathbf{q}_i^{[j]}$, while utilizing the Key and Value matrices, $\mathbf{K}_{\le i}$ and $\mathbf{V}_{\le i}$, which are shared across all heads. This is represented by the formula: $$\text{head}_j = \text{Att}_{\text{qkv}}(\mathbf{q}_i^{[j]}, \mathbf{K}_{\le i}, \mathbf{V}_{\le i})$$

Individual Attention Head Formula in Multi-Query Attention (MQA)

Multi-Query Attention (MQA) is an architectural refinement of the standard multi-head attention model designed for greater efficiency. In MQA, the Key (K) and Value (V) matrices are shared across all attention heads. This means that for a given step `i`, there is only a single set of keys and values, denoted as $(\mathbf{K}_{\le i}, \mathbf{V}_{\le i})$. However, each of the $\tau$ heads maintains its own distinct query projection, allowing different heads to learn unique focuses while being more computationally and memory efficient than standard multi-head attention.

Google

Improved multi-head attention works to introduce more sophisticated mechanisms that guide the behavior of different attention heads or allow interaction across attention heads, as it is not guaranteed that different attention heads indeed capture distinct features in vanilla transformers.

Improved Multi-Head Attention Mechanism

Reference of Foundations of Large Language Models Course

Multi-Query Attention (MQA)

Grouped-Query Attention (GQA) is an attention mechanism that offers a balance between the computational efficiency of Multi-Query Attention (MQA) and the model expressiveness of standard Multi-Head Attention (MHA). It works by grouping query heads and having each group share a single Key (K) and Value (V) projection. The number of groups, denoted as `ng`, is an adjustable parameter that allows for a trade-off between performance and model quality.

Grouped-Query Attention (GQA)

Cross-layer Multi-head Attention is an architectural variant where an attention layer reuses the Key (K) and Value (V) projections from a preceding layer. As depicted in the relationship between Layer `l-1` and Layer `l`, the current layer (`l`) generates its own Query (Q) vectors but utilizes the K and V vectors that were computed by the previous layer (`l-1`). This sharing of Key and Value parameters across consecutive layers is a technique to improve parameter efficiency.

Learn Before

Related

Learn After