Learn Before
Residual Connections and Layer Normalization in Transformers
In the Transformer architecture, a residual connection is wrapped around each individual sublayer. To ensure the residual addition is mathematically feasible, any input must produce an output of the identical dimension, , allowing the computation of . This residual addition is then immediately followed by a layer normalization operation.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
D2L
Dive into Deep Learning @ D2L
Related
A standard Transformer block processes an input sequence through two main sub-layers using a post-normalization scheme. Arrange the following operations in the correct order from start to finish for a single block.
A language model built with Transformer blocks consistently produces grammatically correct sentences, but the sentences lack contextual coherence. For instance, given the input 'The scientist carefully placed the sample under the microscope to observe its...', the model generates '...color is a vibrant shade of the car.' Which sub-layer within the Transformer blocks is most likely failing to perform its primary function, leading to this specific type of error?
Component Roles in a Transformer Block
Transformer Block Inputs and Outputs Notation
Residual Connections and Layer Normalization in Transformers
Learn After
A sub-layer in a neural network processes an input tensor. The sub-layer uses a specific architectural pattern where a residual connection and a normalization step are applied after the main sub-layer function. Arrange the following operations in the correct sequence to compute the final output of this sub-layer.
A sub-layer within a neural network processes an input
x. The design specifies that the output of the sub-layer's main function,F(x), is first added to the original inputx. A normalization function,Norm(·), is then applied to the result of this addition. Which of the following expressions accurately models this computation?Analyzing Training Instability in a Network Sub-layer
Implementation of the AddNorm Component in Transformers