Definition

Number of Attention Heads

When configuring multi-head self-attention sub-layers in Transformers, one must specify the number of heads, denoted as nheadn_{\mathrm{head}}. Increasing this hyperparameter expands the number of distinct subspaces over which attention is computed. In practical implementations, it is common to configure the model such that nhead4n_{\mathrm{head}} \ge 4.

0

1

Updated 2026-04-17

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related