1Cademy - Parametrization Cost Control in Multi-Head Attention

Learn Before

Query, Key, and Value Projections in Multi-Head Attention

Concept

Parametrization Cost Control in Multi-Head Attention

In a multi-head attention mechanism, utilizing $h$ parallel attention heads could potentially lead to a significant increase in both computational and parametrization costs. To avoid this growth, the dimensionalities of the query, key, and value projections for each individual head (denoted as $p_q$ , $p_k$ , and $p_v$ ) are typically constrained to $p_o / h$ , where $p_o$ is the total desired output dimensionality. By ensuring that $p_q h = p_k h = p_v h = p_o$ , the computations for all $h$ heads can be performed in parallel while maintaining overall resource requirements that are comparable to a single-head attention mechanism with dimensionality $p_o$ .

Updated 2026-05-14

Contributors are:

Who are from:

References

Dive into Deep Learning

Learn Before

Related