Learn Before
Concept

Substitutes of Layer Normalization in Transformers

As alternatives to standard Layer Normalization, several substitutes have been developed to stabilize training or reduce computational cost in Transformers:

  1. AdaNorm: A normalization technique that operates without learnable parameters. It dynamically scales the normalized vector as: z=C(1ky)yz = C(1 - ky) \odot y y=xμσy = \frac{x - \mu}{\sigma}

  2. Scaled 2\ell_2 normalization: This technique projects any dd-dimensional input vector x\mathbf{x} onto a (d1)(d - 1)-sphere of a learned radius gg: z=gxxz = g \frac{\mathbf{x}}{\Vert \mathbf{x} \Vert} where gg is a learnable scalar.

0

1

Updated 2026-06-13

Tags

Data Science

Related