Parametrization Cost Control in Multi-Head Attention
In a multi-head attention mechanism, utilizing parallel attention heads could potentially lead to a significant increase in both computational and parametrization costs. To avoid this growth, the dimensionalities of the query, key, and value projections for each individual head (denoted as , , and ) are typically constrained to , where is the total desired output dimensionality. By ensuring that , the computations for all heads can be performed in parallel while maintaining overall resource requirements that are comparable to a single-head attention mechanism with dimensionality .
0
1
Tags
D2L
Dive into Deep Learning @ D2L
Related
Shape of Key Weight Matrix per Head
Shape of Key Weight Sub-Matrix per Head
In a multi-head attention mechanism with 'M' heads, an engineer makes an implementation error. Instead of creating a unique set of learnable weight matrices for the query, key, and value projections for each of the 'M' heads, the same single set of query, key, and value weight matrices is shared across all heads. What is the primary consequence of this error on the model's functionality?
Rationale for Unique Projections in Multi-Head Attention
Attention Head Specialization
Individual Attention Head Computation (General Vector Form)
Parametrization Cost Control in Multi-Head Attention