Learn Before
Concept

Placement of Layer Normalization in transformers

In vanilla transformers, the LN layer lies between the residual blocks, called post-LN. An improvement to this is called pre-LN, which is when the LN layer is placed inside the residual connection before the attention or FFN, with an additional LN after the final layer to control the magnitude of final outputs. This has shown to eliminate the need for learning-rate warm-up stage.

Image 0

0

1

Updated 2026-04-21

Tags

Data Science

Foundations of Large Language Models

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related