Case Study

Root-Cause Analysis of FFN Output Drift After Swapping Normalization and Activation

You are reviewing a production LLM refactor where the team changed two things in the Transformer block: (1) the FFN activation was changed from GELU to SwiGLU, and (2) standard LayerNorm was replaced with RMSNorm. After the change, offline eval shows a consistent degradation on tasks sensitive to subtle token-level biases (e.g., sentiment and toxicity), even though perplexity is nearly unchanged. A quick probe on a representative hidden-state vector h (per token) shows that before the FFN, the mean of features is no longer close to 0, but the overall magnitude (RMS) is similar to before. The team also notes that with SwiGLU, the multiplicative gate sometimes strongly suppresses or amplifies channels depending on the sign and size of its pre-activation.

As the on-call ML engineer, write a short diagnosis that (a) explains how the difference between standard LayerNorm (re-centering + re-scaling) and RMSNorm (re-scaling only) can interact with the gating behavior of SwiGLU versus the smoother, non-gated behavior of GELU to produce a systematic output drift, and (b) proposes one concrete, minimal change (either to normalization parameters/placement or to the FFN activation choice) that would most directly test your hypothesis without reverting the entire refactor.

Image 0

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related