Learn Before
Concept

Substitutes of Layer Normalization in transformers

  1. AdaNorm, a normalization technique without learnable parameters
z=C(1−ky)⊙yz = C(1-ky) \odot y y=x−Όσy = \frac{x - \mu}{\sigma}
  1. Scaled l2l_2 normalization -> Given any input x of 𝑑-dimension, their approach project it onto a 𝑑 −1-sphere of learned radius 𝑔 z=𝑔x∄x∄z =𝑔 \frac{x}{∄x∄} where 𝑔 is a learnable scalar

0

1

Updated 2022-05-26

Contributors are:

Who are from:

Tags

Data Science

Related