Learn Before
Greedy Decoding in Language Models
In language model inference, a common method for generating text is to select the token with the maximum probability from the predicted distribution at each position. This strategy is applied sequentially, where the model's output at each step is determined by the single most likely next token given the preceding sequence.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Training Decoder-Only Language Models with Cross-Entropy Loss
Output Probability Calculation in Transformer Language Models
Global Nature of Standard Transformer LLMs
Processing Flow of Autoregressive Generation in a Decoder-Only Transformer
Initial Input Representation for Transformer Layers
Greedy Decoding in Language Models
Structure of a Transformer Block
A generative language model is designed to produce text by predicting the next token based solely on the sequence of tokens that came before it. If you were to adapt a standard Transformer decoder block for this specific auto-regressive task, which of its sub-layers would you remove, and why is this modification functionally necessary?
A language model is constructed using a stack of modified Transformer decoder blocks. Each block contains a self-attention sub-layer and a feed-forward network sub-layer, but lacks the sub-layer that would process information from a separate, secondary input sequence. This model is capable of performing a machine translation task, such as translating a German sentence into English, without any further architectural changes.
Function of Self-Attention in Auto-regressive Generation
Neural Network-Based Next-Token Probability Distribution
Learn After
A language model is generating a two-token sequence. At the first step, it assigns a probability of 0.5 to token 'A' and 0.4 to token 'B'. At the second step, if 'A' was chosen, the model assigns a probability of 0.5 to token 'C'. If 'B' was chosen, it assigns a probability of 0.9 to token 'D'. All other tokens have lower probabilities at each step. Based on this information, which statement accurately analyzes the outcome of a purely sequential, maximum-probability selection strategy?
Evaluating a Text Generation Strategy
Applying a Text Generation Strategy