1Cademy - Generalized Formula for Pre-Norm Architecture

Learn Before

Pre-Norm Architecture in Transformers

Formula

Generalized Formula for Pre-Norm Architecture

The operation within a sub-layer of a Transformer block using the pre-norm architecture is generalized by the formula: $\text{output} = \text{LNorm}(F(\text{input})) + \text{input}$ In this equation, F is the sub-layer's function (e.g., self-attention or FFN), and LNorm is Layer Normalization. The input and output are both matrices of size $m \times d$ , where $m$ is the sequence length and $d$ is the representation dimension. Each row in these matrices corresponds to the contextual representation of a specific token in the sequence. This structure applies normalization to the function's output before the residual connection.

Updated 2026-04-21

Contributors are:

Who are from:

References

Learn Before

Related

Learn After