1Cademy - Implementation of the AddNorm Component in Transformers

Learn Before

Residual Connections and Layer Normalization in Transformers

Code

Implementation of the AddNorm Component in Transformers

The AddNorm component in a Transformer implements a residual connection followed immediately by layer normalization, while also applying dropout for regularization. In PyTorch, the module takes the original input tensor $X$ and the sublayer's output tensor $Y$ . It applies dropout to $Y$ , adds $X$ (which requires both tensors to be of identical shape to validly perform element-wise addition), and passes the sum through a layer normalization module:

class AddNorm(nn.Module):
    """The residual connection followed by layer normalization."""
    def __init__(self, norm_shape, dropout):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        self.ln = nn.LayerNorm(norm_shape)

    def forward(self, X, Y):
        return self.ln(self.dropout(Y) + X)

The residual connection requires that the two inputs are of the same shape so that the output tensor also maintains this identical shape after the addition operation. The following PyTorch code verifies this shape consistency using d2l.check_shape:

add_norm = AddNorm(4, 0.5)
shape = (2, 3, 4)
d2l.check_shape(add_norm(torch.ones(shape), torch.ones(shape)), shape)

0

1

Updated 2026-05-15

Contributors are:

Who are from:

References

Dive into Deep Learning

Learn Before

Related