Learn Before
Concept

Residual Connections and Layer Normalization in Transformers

In the Transformer architecture, a residual connection is wrapped around each individual sublayer. To ensure the residual addition is mathematically feasible, any input xRd\mathbf{x} \in \mathbb{R}^d must produce an output of the identical dimension, extrmsublayer(x)Rd extrm{sublayer}(\mathbf{x}) \in \mathbb{R}^d, allowing the computation of x+extrmsublayer(x)Rd\mathbf{x} + extrm{sublayer}(\mathbf{x}) \in \mathbb{R}^d. This residual addition is then immediately followed by a layer normalization operation.

Image 0

0

1

Updated 2026-05-15

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

D2L

Dive into Deep Learning @ D2L