1Cademy - Comparison of Layer Normalization and Batch Normalization in NLP

Learn Before

Layer Normalization in Transformers

Comparison

Comparison of Layer Normalization and Batch Normalization in NLP

Layer normalization operates similarly to batch normalization, but it normalizes across the feature dimension rather than the batch dimension. This structural difference grants layer normalization the advantages of scale independence and batch size independence. Although batch normalization is pervasively utilized in computer vision, it is empirically less effective than layer normalization in natural language processing (NLP) tasks. NLP inputs frequently consist of variable-length sequences, making normalization across the feature dimension significantly more stable and appropriate than standardizing across a minibatch.

The following PyTorch code snippet compares the normalization across different dimensions by layer normalization and batch normalization:

ln = nn.LayerNorm(2)
bn = nn.LazyBatchNorm1d()
X = torch.tensor([[1, 2], [2, 3]], dtype=torch.float32)
# Compute mean and variance from X in the training mode
print('layer norm:', ln(X), '\nbatch norm:', bn(X))

Output:

layer norm: tensor([[-1.0000,  1.0000],
        [-1.0000,  1.0000]], grad_fn=<NativeLayerNormBackward0>)
batch norm: tensor([[-1.0000, -1.0000],
        [ 1.0000,  1.0000]], grad_fn=<NativeBatchNormBackward0>)

0

1

Updated 2026-05-15

Contributors are:

Who are from:

References

Dive into Deep Learning

Learn Before

Related