Grouped-Query Attention (GQA) acts as a generalization that interpolates between standard Multi-Head Attention (MHA) and Multi-Query Attention (MQA). The behavior of GQA is determined by the number of key-value groups, `ng`. When `ng` is set to the total number of query heads, `τ`, each query head effectively gets its own key-value pair, making the model equivalent to standard MHA. Conversely, when `ng` is set to 1, all query heads share a single key-value pair, which is the definition of MQA.

Relationship between GQA, MHA, and MQA

Grouped-Query Attention (GQA) is an attention mechanism that offers a balance between the computational efficiency of Multi-Query Attention (MQA) and the model expressiveness of standard Multi-Head Attention (MHA). It works by grouping query heads and having each group share a single Key (K) and Value (V) projection. The number of groups, denoted as `ng`, is an adjustable parameter that allows for a trade-off between performance and model quality.

Google

Improved multi-head attention works to introduce more sophisticated mechanisms that guide the behavior of different attention heads or allow interaction across attention heads, as it is not guaranteed that different attention heads indeed capture distinct features in vanilla transformers.

Improved Multi-Head Attention Mechanism

Reference of Foundations of Large Language Models Course

Multi-Query Attention (MQA) is an architectural refinement of the standard multi-head attention model designed for greater efficiency. In MQA, the Key (K) and Value (V) matrices are shared across all attention heads. This means that for a given step `i`, there is only a single set of keys and values, denoted as $(\mathbf{K}_{\le i}, \mathbf{V}_{\le i})$. However, each of the $\tau$ heads maintains its own distinct query projection, allowing different heads to learn unique focuses while being more computationally and memory efficient than standard multi-head attention.

Multi-Query Attention (MQA)

Grouped-Query Attention (GQA)

Cross-layer Multi-head Attention is an architectural variant where an attention layer reuses the Key (K) and Value (V) projections from a preceding layer. As depicted in the relationship between Layer `l-1` and Layer `l`, the current layer (`l`) generates its own Query (Q) vectors but utilizes the K and V vectors that were computed by the previous layer (`l-1`). This sharing of Key and Value parameters across consecutive layers is a technique to improve parameter efficiency.

Learn Before

Related

Learn After