Selecting a Normalization + FFN Activation Change After Quantization Regressions
You are leading an LLM inference optimization effort. After switching the model to 8-bit weight-only quantization, you observe a consistent quality regression that correlates with occasional activation outliers in the feed-forward network (FFN) blocks. You are allowed to change (A) the FFN activation (currently GELU) and/or (B) the normalization used before the FFN (currently standard LayerNorm). You must keep the model’s parameter count roughly the same and you cannot add extra normalization layers.
Standard LayerNorm is: LNorm(h) = α * (h − μ) / (σ + ε) + β, where μ and σ are the mean and standard deviation over features of h.
RMSNorm is: RMSNorm(h) = α * h / (rms(h) + ε) + β, where rms(h) = sqrt((1/d) * Σ_k h_k^2) and it does NOT subtract the mean.
SwiGLU is: swish(hW1 + b1) ⊙ (hW2 + b2), i.e., a gated FFN variant where the gate uses the Swish nonlinearity.
Case study: In a diagnostic run, you log the pre-FFN normalized vectors for a problematic layer and find that many tokens have a large positive mean across features (μ is strongly > 0), while their per-feature spread is moderate. The quantization team reports that the worst outliers appear after the FFN nonlinearity, not before it.
Which single change is the most defensible first experiment to reduce post-activation outliers while preserving model quality, and why? Choose one option and justify it by explicitly linking (i) the mean-subtraction vs no-mean-subtraction behavior of the normalization formula and (ii) the gating/smoothness behavior of GELU vs SwiGLU in how they can amplify or dampen large positive shifts.
Options:
- Keep LayerNorm, switch GELU → SwiGLU
- Switch LayerNorm → RMSNorm, keep GELU
- Switch LayerNorm → RMSNorm and switch GELU → SwiGLU
- Keep LayerNorm and keep GELU (change nothing)
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
GELU (Gaussian Error Linear Unit) Formula
Applications of GELU in Large Language Models
An activation function is defined by its behavior of weighting an input value by that value's corresponding cumulative probability from a standard normal distribution (mean=0, variance=1). Given two inputs,
x = -3andy = 3, which statement best describes their respective outputs,f(x)andf(y)?Hendrycks and Gimpel [2016] on GELU
An activation function is designed to scale its input value by the probability that a randomly drawn value from a standard normal distribution (mean=0, variance=1) is less than or equal to that input. How does this function's output for a small negative input (e.g., -0.1) compare to the output of a function that simply sets all negative inputs to zero?
Activation Function Selection for a Language Model
Diagnosing Training Instability When Changing Normalization and FFN Activations
Choosing an FFN Activation and Normalization Pair Under Deployment Constraints
Explaining a Distribution Shift Caused by Swapping LayerNorm for RMSNorm and GELU for SwiGLU
Root-Cause Analysis of FFN Output Drift After Swapping Normalization and Activation
Selecting a Normalization + FFN Activation Change After Quantization Regressions
Interpreting Activation/Normalization Interactions from FFN Telemetry
You are reviewing a teammate’s proposed Transforme...
In a transformer feed-forward block, your team is ...
You’re debugging a transformer FFN refactor where ...
You’re reviewing a PR that changes a transformer b...
SwiGLU (Swish-based Gated Linear Unit) Formula
Applications of SwiGLU in Large Language Models
The family of Gated Linear Unit (GLU) activation functions creates different variants by incorporating a specific non-linear function to control an information 'gate'. Based on this principle, what is the key distinguishing feature of the SwiGLU variant compared to other possible variants in the same family?
Deconstructing the SwiGLU Activation Function
The gating component of the SwiGLU activation function is controlled by a non-linear function that is strictly increasing across its entire domain.
You are reviewing a teammate’s proposed Transforme...
In a transformer feed-forward block, your team is ...
You’re reviewing a PR that changes a transformer b...
You’re debugging a transformer FFN refactor where ...
Explaining a Distribution Shift Caused by Swapping LayerNorm for RMSNorm and GELU for SwiGLU
Choosing an FFN Activation and Normalization Pair Under Deployment Constraints
Diagnosing Training Instability When Changing Normalization and FFN Activations
Interpreting Activation/Normalization Interactions from FFN Telemetry
Root-Cause Analysis of FFN Output Drift After Swapping Normalization and Activation
Selecting a Normalization + FFN Activation Change After Quantization Regressions
RMS Layer Normalization Formula
Root Mean Square (RMS) of a Vector
An input vector to a neural network layer consists of elements that are all large positive values. This vector is processed by two different normalization techniques. Technique A first calculates the average of the elements and subtracts it from each element, then scales the result. Technique B bypasses the subtraction step and only scales the elements based on their root mean square magnitude. Which statement best describes the fundamental difference between the output vectors produced by these two techniques?
Comparing Normalization Procedure Outcomes
True or False: A normalization technique that operates by dividing each element of an input vector by the vector's root mean square (without first subtracting the mean) guarantees that the resulting output vector will have a mean of zero.
You are reviewing a teammate’s proposed Transforme...
In a transformer feed-forward block, your team is ...
You’re reviewing a PR that changes a transformer b...
You’re debugging a transformer FFN refactor where ...
Explaining a Distribution Shift Caused by Swapping LayerNorm for RMSNorm and GELU for SwiGLU
Choosing an FFN Activation and Normalization Pair Under Deployment Constraints
Diagnosing Training Instability When Changing Normalization and FFN Activations
Interpreting Activation/Normalization Interactions from FFN Telemetry
Root-Cause Analysis of FFN Output Drift After Swapping Normalization and Activation
Selecting a Normalization + FFN Activation Change After Quantization Regressions
Applying the Layer Normalization Formula
Root Mean Square (RMS) Layer Normalization
In the Layer Normalization formula, what is the primary purpose of including the learnable gain () and bias () parameters?
An engineer modifies the standard Layer Normalization formula,
LNorm(h) = α * (h - μ) / (σ + ε) + β, by removing the mean-subtraction step (- μ). The new operation isModifiedLNorm(h) = α * h / (σ + ε) + β. How will the output of this modified operation fundamentally differ from the output of the standard operation?You are reviewing a teammate’s proposed Transforme...
In a transformer feed-forward block, your team is ...
You’re reviewing a PR that changes a transformer b...
You’re debugging a transformer FFN refactor where ...
Explaining a Distribution Shift Caused by Swapping LayerNorm for RMSNorm and GELU for SwiGLU
Choosing an FFN Activation and Normalization Pair Under Deployment Constraints
Diagnosing Training Instability When Changing Normalization and FFN Activations
Interpreting Activation/Normalization Interactions from FFN Telemetry
Root-Cause Analysis of FFN Output Drift After Swapping Normalization and Activation
Selecting a Normalization + FFN Activation Change After Quantization Regressions