Essay

Diagnosing a Transformer Block Refactor: Attention/FFN Shapes and Norm Placement

You are reviewing a teammate’s refactor of a Transformer block used in an internal LLM service. The block should contain (1) a multi-head self-attention sub-layer and (2) a two-layer position-wise FFN, each with a residual connection and layer normalization. The model uses sequence length m and model width d, and the attention module uses h heads.

During integration testing, you see two symptoms: (A) training becomes unstable when stacking many blocks (loss spikes and occasional divergence), and (B) a shape error appears in the FFN path when d=512 and the hidden width is set to d_h=2048.

Write a technical diagnosis that:

  1. Identifies the most likely normalization/residual ordering mistake that would explain symptom (A), explicitly contrasting pre-norm vs post-norm computation for a sub-layer and why one tends to be more stable in very deep stacks.
  2. Explains, using correct tensor/matrix dimensions, how multi-head self-attention can take an input H ∈ R^{m×d} and still return an output in R^{m×d} (include what happens across heads conceptually, and why concatenation must be followed by a projection).
  3. Pinpoints the FFN dimension mismatch that could cause symptom (B) and states the required shapes of W_h, b_h, W_f, and b_f so that the FFN maps an input of width d back to width d.

Your answer should connect these three parts into a single coherent explanation of how a correct Transformer block maintains consistent dimensions while using residuals and layer normalization, and how an incorrect norm placement can interact with deep stacking to destabilize training.

Image 0

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Data Science

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.2 Generative Models - Foundations of Large Language Models

Transformer

Related