Code

Implementation of the AddNorm Component in Transformers

The AddNorm component in a Transformer implements a residual connection followed immediately by layer normalization, while also applying dropout for regularization. In PyTorch, the module takes the original input tensor XX and the sublayer's output tensor YY. It applies dropout to YY, adds XX (which requires both tensors to be of identical shape to validly perform element-wise addition), and passes the sum through a layer normalization module:

class AddNorm(nn.Module): """The residual connection followed by layer normalization.""" def __init__(self, norm_shape, dropout): super().__init__() self.dropout = nn.Dropout(dropout) self.ln = nn.LayerNorm(norm_shape) def forward(self, X, Y): return self.ln(self.dropout(Y) + X)

The residual connection requires that the two inputs are of the same shape so that the output tensor also maintains this identical shape after the addition operation. The following PyTorch code verifies this shape consistency using d2l.check_shape:

add_norm = AddNorm(4, 0.5) shape = (2, 3, 4) d2l.check_shape(add_norm(torch.ones(shape), torch.ones(shape)), shape)

0

1

Updated 2026-05-15

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L