Learn Before
  • Transformer Blocks and Post-Norm Architecture

Pre-Norm Architecture in Transformers

The pre-norm architecture is an alternative design for Transformer sub-layers where Layer Normalization (LNorm) is applied to the sub-layer's function output before the residual connection. This approach can enhance training stability in deep networks. The operation is defined by the formula: output=LNorm(F(input))+input\text{output} = \text{LNorm}(F(\text{input})) + \text{input} In this context, both input and output are represented as mdm \times d matrices, where mm is the sequence length and dd is the representation dimension. Each row in these matrices corresponds to the contextual representation of a specific token in the sequence.

Image 0

0

1

17 days ago

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
  • A transformer block showing all the layers

  • BERT's Core Architecture

  • Decoder-Only Transformer as a Language Model

  • Generalized Formula for Post-Norm Architecture

  • Pre-Norm Architecture in Transformers

  • Core Function F(路) in Transformer Sub-layers

  • Output Probability Calculation in Transformer Language Models