1Cademy - Pre-Norm Architecture in Transformers

Approach 1: Processes the input text sequentially, token by token, updating an internal state that is passed from one step to the next.
Approach 2: Processes all input tokens simultaneously, using a mechanism that directly relates every token to every other token in the input to determine context.

Learn Before

Transformer
Placement of Layer Normalization in transformers

Concept

Pre-Norm Architecture in Transformers

In Transformer-based systems, the pre-norm architecture is a specific sub-layer configuration where layer normalization is applied internally within a residual block. Because this approach is remarkably effective at stabilizing the training of deep neural networks, it serves as the underlying structural basis for the majority of modern Large Language Models.