Essay

Diagnosing Training Instability When Changing Normalization and FFN Activations

You are reviewing a teammate’s proposed change to your company’s in-house Transformer block used for a customer-support LLM. They want to (a) replace standard LayerNorm with RMSNorm and (b) replace the FFN activation from GELU to SwiGLU. After the change, early training becomes less stable: loss occasionally spikes, and activation magnitudes in the FFN show a persistent positive mean shift across features (measured per token) even though the overall scale seems controlled.

Write an analysis that explains, using the LayerNorm formula and the defining behavior of RMSNorm, how removing mean-centering can allow a non-zero mean to persist even when the vector is rescaled. Then connect that to how the choice of FFN nonlinearity (GELU vs a gated activation like SwiGLU) can interact with this mean shift to amplify or dampen instability. Conclude with one concrete, technically justified adjustment you would recommend (e.g., where to place/parameterize normalization, whether to keep/modify bias terms, or which activation to use) and explain the tradeoff your recommendation makes.

Image 0

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related