Concept

Parametrization Cost Control in Multi-Head Attention

In a multi-head attention mechanism, utilizing hh parallel attention heads could potentially lead to a significant increase in both computational and parametrization costs. To avoid this growth, the dimensionalities of the query, key, and value projections for each individual head (denoted as pqp_q, pkp_k, and pvp_v) are typically constrained to po/hp_o / h, where pop_o is the total desired output dimensionality. By ensuring that pqh=pkh=pvh=pop_q h = p_k h = p_v h = p_o, the computations for all hh heads can be performed in parallel while maintaining overall resource requirements that are comparable to a single-head attention mechanism with dimensionality pop_o.

0

1

Updated 2026-05-14

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L