Learn Before
Comparison

Comparison of Layer Normalization and Batch Normalization in NLP

Layer normalization operates similarly to batch normalization, but it normalizes across the feature dimension rather than the batch dimension. This structural difference grants layer normalization the advantages of scale independence and batch size independence. Although batch normalization is pervasively utilized in computer vision, it is empirically less effective than layer normalization in natural language processing (NLP) tasks. NLP inputs frequently consist of variable-length sequences, making normalization across the feature dimension significantly more stable and appropriate than standardizing across a minibatch.

The following PyTorch code snippet compares the normalization across different dimensions by layer normalization and batch normalization:

ln = nn.LayerNorm(2) bn = nn.LazyBatchNorm1d() X = torch.tensor([[1, 2], [2, 3]], dtype=torch.float32) # Compute mean and variance from X in the training mode print('layer norm:', ln(X), '\nbatch norm:', bn(X))

Output:

layer norm: tensor([[-1.0000, 1.0000], [-1.0000, 1.0000]], grad_fn=<NativeLayerNormBackward0>) batch norm: tensor([[-1.0000, -1.0000], [ 1.0000, 1.0000]], grad_fn=<NativeBatchNormBackward0>)

0

1

Updated 2026-05-15

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L

Related