Diagram of the Transformer Language Model Forward Pass
This diagram illustrates the sequential forward pass of a Transformer-based language model, breaking down the process into several key stages:
- Input Tokens: The model begins with an input sequence of tokens, denoted as .
- Token Embeddings: These tokens are converted into a sequence of numerical embeddings .
- Transformer Blocks: The embedding sequence is processed through a stack of identical Transformer blocks. Each block contains a self-attention sub-layer and a feed-forward network (FFN) sub-layer, and can utilize either a post-norm or pre-norm architecture.
- Final Hidden States: The final (L-th) block outputs a sequence of contextualized hidden states, .
- Logits: These hidden states are linearly transformed into a sequence of unnormalized scores called logits, .
- Conditional Probabilities: A Softmax function converts the logits into conditional probability distributions for each position, such as .
- Output Tokens: These probabilities are then used to predict the subsequent tokens in the sequence, .
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Logits in Transformer Language Models
Final Hidden States in a Transformer Language Model
Next-Token Probability Calculation in Autoregressive Decoders
Diagram of the Decoding Phase
Diagram of the Transformer Language Model Forward Pass
Diagram of the Autoregressive Generation Architectural Flow
A decoder-only language model generates text one token at a time in a step-by-step process. Arrange the following steps in the correct chronological order for generating a single new token, given an initial prompt and any previously generated tokens.
In the step-by-step generation process of a decoder-only language model, consider a hypothetical modification at generation step
i. Instead of using the initial prompt combined with all previously generated tokens as input, the model is only given the initial prompt. What is the most likely consequence of this change on the generated text?Diagnosing a Generation Failure in a Decoder-Only Model
Learn After
A language model is processing an input sequence of text to predict the most likely next word. Arrange the following key computational stages of its forward pass in the correct chronological order, from initial input to final output.
A developer is debugging a Transformer-based language model and observes a specific issue: for any given input sequence, the model produces a valid probability distribution for the next token, but the predicted token seems to have no contextual relationship with the preceding tokens. For example, after the input 'The dog chased the...', the model assigns a high probability to the word 'airplane'. Which component of the forward pass is most likely failing to perform its function, leading to this loss of context?
Transformer Model Output Anomaly