1Cademy - Residual Connections and Layer Normalization in Transformers

Learn Before

Structure of a Transformer Block

Concept

Residual Connections and Layer Normalization in Transformers

In the Transformer architecture, a residual connection is wrapped around each individual sublayer. To ensure the residual addition is mathematically feasible, any input $\mathbf{x} \in \mathbb{R}^d$ must produce an output of the identical dimension, $extrm{sublayer}(\mathbf{x}) \in \mathbb{R}^d$ , allowing the computation of $\mathbf{x} + extrm{sublayer}(\mathbf{x}) \in \mathbb{R}^d$ . This residual addition is then immediately followed by a layer normalization operation.