1Cademy - Diagram of the Transformer Language Model Forward Pass

How it works Courses Research Communities Benefits About Us

Learn Before

Processing Flow of Autoregressive Generation in a Decoder-Only Transformer

Example

Diagram of the Transformer Language Model Forward Pass

This diagram illustrates the sequential forward pass of a Transformer-based language model, breaking down the process into several key stages:

Input Tokens: The model begins with an input sequence of tokens, denoted as $(x_0, x_1, ..., x_{m-1})$ .
Token Embeddings: These tokens are converted into a sequence of numerical embeddings $(e_0, e_1, ..., e_{m-1})$ .
Transformer Blocks: The embedding sequence is processed through a stack of $L$ identical Transformer blocks. Each block contains a self-attention sub-layer and a feed-forward network (FFN) sub-layer, and can utilize either a post-norm or pre-norm architecture.
Final Hidden States: The final (L-th) block outputs a sequence of contextualized hidden states, $(h^L_0, h^L_1, ..., h^L_{m-1})$ .
Logits: These hidden states are linearly transformed into a sequence of unnormalized scores called logits, $(z_0, z_1, ..., z_{m-1})$ .
Conditional Probabilities: A Softmax function converts the logits into conditional probability distributions for each position, such as $\text{Pr}(x_2|x_0, x_1)$ .
Output Tokens: These probabilities are then used to predict the subsequent tokens in the sequence, $(x_1, x_2, ..., x_m)$ .

0

1

Updated 2025-10-09

Contributors are:

Gemini AI

Who are from:

Google

References

Reference of Foundations of Large Language Models Course

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related

Learn After