Case Study

Incident Review: Silent Performance Regression After “Optimization” of a Transformer Block

You are reviewing a production incident in an internal LLM service. A teammate made an “optimization” to a Transformer block to reduce compute and simplify code. After the change, training no longer diverges, but model quality drops noticeably (worse long-context retrieval and weaker instruction following) while throughput improves. You are given the following implementation notes for one block (sequence length m, model width d):

  • The block has two sub-layers in order: (1) multi-head self-attention, (2) a 2-layer FFN.
  • The teammate changed the FFN hidden size from d_h = 4d to d_h = d.
  • They also changed normalization placement from post-norm to pre-norm, but their code now does:
    1. y = x + Attention(LNorm(x))
    2. z = y + FFN(LNorm(y))
    3. output = LNorm(z)
  • They claim this is “equivalent but faster” because the FFN is narrower and “extra LNorm at the end keeps things stable.”

As the reviewer, identify the most likely two root causes of the quality regression that follow from the interaction of (a) multi-head self-attention’s role, (b) the FFN’s dimensionality/structure, and (c) layer normalization placement (pre-norm vs post-norm). Then propose one concrete code-level correction (in words or pseudocode) that would address the regression while keeping training stable, and justify why it helps.

Image 0

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Data Science

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.2 Generative Models - Foundations of Large Language Models

Transformer

Related