1Cademy - Pre-Norm Architecture in Transformers

Learn Before

Transformer Blocks and Post-Norm Architecture

Pre-Norm Architecture in Transformers

The pre-norm architecture is an alternative design for Transformer sub-layers where Layer Normalization (LNorm) is applied to the sub-layer's function output before the residual connection. This approach can enhance training stability in deep networks. The operation is defined by the formula: $\text{output} = \text{LNorm}(F(\text{input})) + \text{input}$ In this context, both input and output are represented as $m \times d$ matrices, where $m$ is the sequence length and $d$ is the representation dimension. Each row in these matrices corresponds to the contextual representation of a specific token in the sequence.

0

1

17 days ago

Contributors are:

Who are from:

References

Learn Before

Related