Case Study

Interpreting Activation/Normalization Interactions from FFN Telemetry

You are on an LLM platform team reviewing a regression after a refactor of the Transformer feed-forward network (FFN). The refactor changed two things at once: (1) the FFN activation was changed from GELU to a gated variant (SwiGLU), and (2) the normalization layer was changed from standard LayerNorm to RMSNorm. No other hyperparameters were intentionally changed.

You have the following telemetry collected on the same held-out batch, measured at the input to the FFN (right after normalization) and at the FFN output (right before the residual add):

Before refactor (LayerNorm + GELU):

  • Normalized FFN input: per-token feature mean ≈ 0.00, per-token feature std ≈ 1.00
  • FFN output: ~48% of elements are negative; output mean ≈ 0.00

After refactor (RMSNorm + SwiGLU):

  • Normalized FFN input: per-token feature mean ≈ +0.35, per-token RMS ≈ 1.00
  • FFN output: ~8% of elements are negative; output mean ≈ +0.60

Assume RMSNorm is implemented as y = α * h / (rms(h) + ε) + β (no mean subtraction), and standard LayerNorm is y = α * (h − μ) / (σ + ε) + β. Also assume SwiGLU is implemented as swish(hW1 + b1) ⊙ (hW2 + b2).

As the reviewer, write a concise root-cause analysis that explains how the combination of (a) removing mean-centering in normalization and (b) switching from GELU to a gated activation could plausibly produce the observed shift toward positive FFN outputs. Your answer must explicitly connect the normalization formulas to the gating behavior (element-wise product) and explain why the effect is directional (more positive) rather than just a change in scale.

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related