Formula

Generalized Formula for Pre-Norm Architecture

The operation within a sub-layer of a Transformer block using the pre-norm architecture is generalized by the formula: output=LNorm(F(input))+input\text{output} = \text{LNorm}(F(\text{input})) + \text{input} In this equation, F is the sub-layer's function (e.g., self-attention or FFN), and LNorm is Layer Normalization. The input and output are both matrices of size m×dm \times d, where mm is the sequence length and dd is the representation dimension. Each row in these matrices corresponds to the contextual representation of a specific token in the sequence. This structure applies normalization to the function's output before the residual connection.

Image 0

0

1

Updated 2026-04-21

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related