Learn Before
Processing Flow of Autoregressive Generation in a Decoder-Only Transformer
The process of autoregressive generation in a decoder-only Transformer involves a step-by-step architectural flow. At each generation step i, the input sequence is formed by concatenating the initial prompt x with all previously generated tokens y_{<i}. This combined sequence is first converted into embeddings by an embedding layer. The embeddings are then processed through a stack of L decoder layers, each containing self-attention and feed-forward network (FFN) modules. The output from the final decoder layer undergoes a linear mapping and is then passed to a Softmax layer. This produces the conditional probability distribution, Pr(·|x, y_{<i}), which is used to select the next token y_i.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.5 Inference - Foundations of Large Language Models
Related
Training Decoder-Only Language Models with Cross-Entropy Loss
Output Probability Calculation in Transformer Language Models
Global Nature of Standard Transformer LLMs
Processing Flow of Autoregressive Generation in a Decoder-Only Transformer
Initial Input Representation for Transformer Layers
Greedy Decoding in Language Models
Structure of a Transformer Block
A generative language model is designed to produce text by predicting the next token based solely on the sequence of tokens that came before it. If you were to adapt a standard Transformer decoder block for this specific auto-regressive task, which of its sub-layers would you remove, and why is this modification functionally necessary?
A language model is constructed using a stack of modified Transformer decoder blocks. Each block contains a self-attention sub-layer and a feed-forward network sub-layer, but lacks the sub-layer that would process information from a separate, secondary input sequence. This model is capable of performing a machine translation task, such as translating a German sentence into English, without any further architectural changes.
Function of Self-Attention in Auto-regressive Generation
Neural Network-Based Next-Token Probability Distribution
Learn After
Logits in Transformer Language Models
Final Hidden States in a Transformer Language Model
Next-Token Probability Calculation in Autoregressive Decoders
Diagram of the Decoding Phase
Diagram of the Transformer Language Model Forward Pass
Diagram of the Autoregressive Generation Architectural Flow
A decoder-only language model generates text one token at a time in a step-by-step process. Arrange the following steps in the correct chronological order for generating a single new token, given an initial prompt and any previously generated tokens.
In the step-by-step generation process of a decoder-only language model, consider a hypothetical modification at generation step
i. Instead of using the initial prompt combined with all previously generated tokens as input, the model is only given the initial prompt. What is the most likely consequence of this change on the generated text?Diagnosing a Generation Failure in a Decoder-Only Model