Implementation of the AddNorm Component in Transformers
The AddNorm component in a Transformer implements a residual connection followed immediately by layer normalization, while also applying dropout for regularization. In PyTorch, the module takes the original input tensor and the sublayer's output tensor . It applies dropout to , adds (which requires both tensors to be of identical shape to validly perform element-wise addition), and passes the sum through a layer normalization module:
class AddNorm(nn.Module): """The residual connection followed by layer normalization.""" def __init__(self, norm_shape, dropout): super().__init__() self.dropout = nn.Dropout(dropout) self.ln = nn.LayerNorm(norm_shape) def forward(self, X, Y): return self.ln(self.dropout(Y) + X)
The residual connection requires that the two inputs are of the same shape so that the output tensor also maintains this identical shape after the addition operation. The following PyTorch code verifies this shape consistency using d2l.check_shape:
add_norm = AddNorm(4, 0.5) shape = (2, 3, 4) d2l.check_shape(add_norm(torch.ones(shape), torch.ones(shape)), shape)
0
1
Tags
D2L
Dive into Deep Learning @ D2L
Related
A sub-layer in a neural network processes an input tensor. The sub-layer uses a specific architectural pattern where a residual connection and a normalization step are applied after the main sub-layer function. Arrange the following operations in the correct sequence to compute the final output of this sub-layer.
A sub-layer within a neural network processes an input
x. The design specifies that the output of the sub-layer's main function,F(x), is first added to the original inputx. A normalization function,Norm(·), is then applied to the result of this addition. Which of the following expressions accurately models this computation?Analyzing Training Instability in a Network Sub-layer
Implementation of the AddNorm Component in Transformers