In Multi-Query Attention (MQA), the output for an individual head `j` is calculated using its unique query vector, $\mathbf{q}_i^{[j]}$, while utilizing the Key and Value matrices, $\mathbf{K}_{\le i}$ and $\mathbf{V}_{\le i}$, which are shared across all heads. This is represented by the formula: $$\text{head}_j = \text{Att}_{\text{qkv}}(\mathbf{q}_i^{[j]}, \mathbf{K}_{\le i}, \mathbf{V}_{\le i})$$

Google

Multi-Query Attention (MQA) is an architectural refinement of the standard multi-head attention model designed for greater efficiency. In MQA, the Key (K) and Value (V) matrices are shared across all attention heads. This means that for a given step `i`, there is only a single set of keys and values, denoted as $(\mathbf{K}_{\le i}, \mathbf{V}_{\le i})$. However, each of the $\tau$ heads maintains its own distinct query projection, allowing different heads to learn unique focuses while being more computationally and memory efficient than standard multi-head attention.

Learn Before

Related