Formula

Multi-Head Attention Output Calculation

Given a representation matrix HRm×d\mathbf{H} \in \mathbb{R}^{m \times d}, the multi-head self-attention function computes its output by concatenating the results from multiple individual attention heads. This relationship is formalized as:

F(H)=Merge(head1,,headτ)WheadF(\mathbf{H}) = \mathrm{Merge}(\mathrm{head}_1, \dots, \mathrm{head}_\tau) \mathbf{W}^{\mathrm{head}}

In this equation, Merge()\mathrm{Merge}(\cdot) signifies the concatenation of its inputs. Each element headj\mathrm{head}_j represents the output derived from applying Query-Key-Value (QKV) attention to a specific sub-space of the initial representation. Finally, the concatenated results are projected via multiplication with a parameter matrix WheadRd×d\mathbf{W}^{\mathrm{head}} \in \mathbb{R}^{d \times d} to yield the final sequence representation.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences