Example

Diagram of the Transformer Language Model Forward Pass

This diagram illustrates the sequential forward pass of a Transformer-based language model, breaking down the process into several key stages:

  1. Input Tokens: The model begins with an input sequence of tokens, denoted as (x0,x1,...,xm1)(x_0, x_1, ..., x_{m-1}).
  2. Token Embeddings: These tokens are converted into a sequence of numerical embeddings (e0,e1,...,em1)(e_0, e_1, ..., e_{m-1}).
  3. Transformer Blocks: The embedding sequence is processed through a stack of LL identical Transformer blocks. Each block contains a self-attention sub-layer and a feed-forward network (FFN) sub-layer, and can utilize either a post-norm or pre-norm architecture.
  4. Final Hidden States: The final (L-th) block outputs a sequence of contextualized hidden states, (h0L,h1L,...,hm1L)(h^L_0, h^L_1, ..., h^L_{m-1}).
  5. Logits: These hidden states are linearly transformed into a sequence of unnormalized scores called logits, (z0,z1,...,zm1)(z_0, z_1, ..., z_{m-1}).
  6. Conditional Probabilities: A Softmax function converts the logits into conditional probability distributions for each position, such as Pr(x2x0,x1)\text{Pr}(x_2|x_0, x_1).
  7. Output Tokens: These probabilities are then used to predict the subsequent tokens in the sequence, (x1,x2,...,xm)(x_1, x_2, ..., x_m).

0

1

Updated 2025-10-09

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences