1Cademy - Layer Normalization Formula

Learn Before

Layer Normalization in Transformers

Formula

Layer Normalization Formula

A widely adopted form of the layer normalization function calculates the normalized output for a $d$ -dimensional real-valued vector $\mathbf{h}$ as follows:

$\mathrm{LNorm}(\mathbf{h}) = \alpha \cdot \frac{\mathbf{h} - \mathbf{\mu}}{\sigma + \epsilon} + \beta$

In this equation, $\mathbf{\mu}$ and $\sigma$ are the mean and standard deviation of all the entries in the vector $\mathbf{h}$ . To maintain numerical stability, the term $\epsilon$ is included. The parameters $\alpha \in \mathbb{R}^{d}$ and $\beta \in \mathbb{R}^{d}$ correspond to the gain and bias terms.

Updated 2026-04-21

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course

Learn After

Applying the Layer Normalization Formula
Root Mean Square (RMS) Layer Normalization
In the Layer Normalization formula, $\text{LNorm}(\mathbf{h}) = \alpha \cdot \frac{\mathbf{h} - \mu}{\sigma + \epsilon} + \beta$ what is the primary purpose of including the learnable gain ( $\alpha$ ) and bias ( $\beta$ ) parameters?
An engineer modifies the standard Layer Normalization formula, LNorm(h) = α * (h - μ) / (σ + ε) + β, by removing the mean-subtraction step (- μ). The new operation is ModifiedLNorm(h) = α * h / (σ + ε) + β. How will the output of this modified operation fundamentally differ from the output of the standard operation?
You are reviewing a teammate’s proposed Transforme...
In a transformer feed-forward block, your team is ...
You’re reviewing a PR that changes a transformer b...
You’re debugging a transformer FFN refactor where ...
Explaining a Distribution Shift Caused by Swapping LayerNorm for RMSNorm and GELU for SwiGLU
Choosing an FFN Activation and Normalization Pair Under Deployment Constraints
Diagnosing Training Instability When Changing Normalization and FFN Activations
Interpreting Activation/Normalization Interactions from FFN Telemetry
Root-Cause Analysis of FFN Output Drift After Swapping Normalization and Activation
Selecting a Normalization + FFN Activation Change After Quantization Regressions

Learn Before

Related

Learn After