Learn Before
Formula for Post-Normalization in a Transformer Sub-layer
In the post-norm architecture of a Transformer sub-layer, the output is calculated using a specific formula. First, the sub-layer's function, represented by (which could be either self-attention or a feed-forward network), is applied to the input. The result, , is then added to the original input in a residual connection. Finally, Layer Normalization (LNorm) is applied to this sum. The complete formula is:

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Formula for Post-Normalization in a Transformer Sub-layer
A standard Transformer block processes an input sequence through two main sub-layers using a post-normalization scheme. Arrange the following operations in the correct order from start to finish for a single block.
A language model built with Transformer blocks consistently produces grammatically correct sentences, but the sentences lack contextual coherence. For instance, given the input 'The scientist carefully placed the sample under the microscope to observe its...', the model generates '...color is a vibrant shade of the car.' Which sub-layer within the Transformer blocks is most likely failing to perform its primary function, leading to this specific type of error?
Component Roles in a Transformer Block
Transformer Block Inputs and Outputs Notation
Learn After
A sub-layer in a neural network processes an input tensor. The sub-layer uses a specific architectural pattern where a residual connection and a normalization step are applied after the main sub-layer function. Arrange the following operations in the correct sequence to compute the final output of this sub-layer.
A sub-layer within a neural network processes an input
x. The design specifies that the output of the sub-layer's main function,F(x), is first added to the original inputx. A normalization function,Norm(·), is then applied to the result of this addition. Which of the following expressions accurately models this computation?Analyzing Training Instability in a Network Sub-layer