1Cademy - Design a Transformer Block Spec for a New Internal LLM Library (Shapes + Norm Placement)

Learn Before

Short Answer

Design a Transformer Block Spec for a New Internal LLM Library (Shapes + Norm Placement)

You are writing a one-page implementation spec for a new internal Transformer-block API that must be unambiguous enough for two different teams (training + inference) to implement independently and still produce identical tensor shapes and computation order.

Constraints:

The block input is H ∈ R^{m×d} (m = sequence length, d = model width).
The block contains exactly two sub-layers in this order: (1) multi-head self-attention, (2) a 2-layer position-wise FFN.
Multi-head attention uses n_head heads with per-head dimension d_k such that concatenation returns to width d.
The FFN must expand to hidden width d_h and return to width d using the standard formula FFN(h)=σ(hW_h+b_h)W_f+b_f.
You must choose either a pre-norm or post-norm scheme and specify precisely where LayerNorm is applied relative to F(·) and the residual addition for BOTH sub-layers.

Create the spec by writing:

A step-by-step computation graph (as numbered equations) for the full block from input H to output H_out, including residual connections and LayerNorm placement.
The required matrix dimensions for W_q, W_k, W_v, the output projection W_o, and the FFN matrices W_h and W_f (use d, d_h, n_head, d_k; you may assume d = n_head·d_k).

Your answer must be internally consistent: every addition must be shape-compatible, and your norm placement must match the scheme you chose.

Updated 2026-02-06

Contributors are:

Who are from:

Learn Before

Related