Short Answer

Design a Transformer Block Spec for a New Internal LLM Library (Shapes + Norm Placement)

You are writing a one-page implementation spec for a new internal Transformer-block API that must be unambiguous enough for two different teams (training + inference) to implement independently and still produce identical tensor shapes and computation order.

Constraints:

  • The block input is H ∈ R^{m×d} (m = sequence length, d = model width).
  • The block contains exactly two sub-layers in this order: (1) multi-head self-attention, (2) a 2-layer position-wise FFN.
  • Multi-head attention uses n_head heads with per-head dimension d_k such that concatenation returns to width d.
  • The FFN must expand to hidden width d_h and return to width d using the standard formula FFN(h)=σ(hW_h+b_h)W_f+b_f.
  • You must choose either a pre-norm or post-norm scheme and specify precisely where LayerNorm is applied relative to F(·) and the residual addition for BOTH sub-layers.

Create the spec by writing:

  1. A step-by-step computation graph (as numbered equations) for the full block from input H to output H_out, including residual connections and LayerNorm placement.
  2. The required matrix dimensions for W_q, W_k, W_v, the output projection W_o, and the FFN matrices W_h and W_f (use d, d_h, n_head, d_k; you may assume d = n_head·d_k).

Your answer must be internally consistent: every addition must be shape-compatible, and your norm placement must match the scheme you chose.

Image 0

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Data Science

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.2 Generative Models - Foundations of Large Language Models

Transformer

Related