Concept

Multi-Query Attention (MQA)

Multi-Query Attention (MQA) is an architectural refinement of the standard multi-head attention model designed for greater efficiency by sharing keys and values across heads, while allowing queries to be unique for each head. In MQA, for a given step ii, there is a single set of shared keys and values, denoted as (Ki,Vi)(\mathbf{K}_{\le i}, \mathbf{V}_{\le i}). In contrast, there are τ\tau distinct queries, denoted as {qi[1],,qi[τ]}\left\{\mathbf{q}_{i}^{[1]},\dots,\mathbf{q}_{i}^{[\tau]}\right\}, each corresponding to a different attention head. This allows different heads to learn unique focuses while being more computationally and memory efficient than standard multi-head attention.

0

1

Updated 2026-04-23

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences