Learn Before
Transformer Blocks and Post-Norm Architecture
Pre-Norm Architecture in Transformers
The pre-norm architecture is an alternative design for Transformer sub-layers where Layer Normalization (LNorm
) is applied to the sub-layer's function output before the residual connection. This approach can enhance training stability in deep networks. The operation is defined by the formula: In this context, both input
and output
are represented as matrices, where is the sequence length and is the representation dimension. Each row in these matrices corresponds to the contextual representation of a specific token in the sequence.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
A transformer block showing all the layers
BERT's Core Architecture
Decoder-Only Transformer as a Language Model
Generalized Formula for Post-Norm Architecture
Pre-Norm Architecture in Transformers
Core Function F(路) in Transformer Sub-layers
Output Probability Calculation in Transformer Language Models