Case Study

Selecting a Normalization + FFN Activation Change After Quantization Regressions

You are leading an LLM inference optimization effort. After switching the model to 8-bit weight-only quantization, you observe a consistent quality regression that correlates with occasional activation outliers in the feed-forward network (FFN) blocks. You are allowed to change (A) the FFN activation (currently GELU) and/or (B) the normalization used before the FFN (currently standard LayerNorm). You must keep the model’s parameter count roughly the same and you cannot add extra normalization layers.

Standard LayerNorm is: LNorm(h) = α * (h − μ) / (σ + ε) + β, where μ and σ are the mean and standard deviation over features of h.

RMSNorm is: RMSNorm(h) = α * h / (rms(h) + ε) + β, where rms(h) = sqrt((1/d) * Σ_k h_k^2) and it does NOT subtract the mean.

SwiGLU is: swish(hW1 + b1) ⊙ (hW2 + b2), i.e., a gated FFN variant where the gate uses the Swish nonlinearity.

Case study: In a diagnostic run, you log the pre-FFN normalized vectors for a problematic layer and find that many tokens have a large positive mean across features (μ is strongly > 0), while their per-feature spread is moderate. The quantization team reports that the worst outliers appear after the FFN nonlinearity, not before it.

Which single change is the most defensible first experiment to reduce post-activation outliers while preserving model quality, and why? Choose one option and justify it by explicitly linking (i) the mean-subtraction vs no-mean-subtraction behavior of the normalization formula and (ii) the gating/smoothness behavior of GELU vs SwiGLU in how they can amplify or dampen large positive shifts.

Options:

  1. Keep LayerNorm, switch GELU → SwiGLU
  2. Switch LayerNorm → RMSNorm, keep GELU
  3. Switch LayerNorm → RMSNorm and switch GELU → SwiGLU
  4. Keep LayerNorm and keep GELU (change nothing)

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related