Learn Before
Function of Self-Attention in Auto-regressive Generation
A language model is built using a stack of modified Transformer decoder blocks. In these blocks, the sub-layer responsible for attending to a separate input sequence has been removed, leaving only the self-attention and feed-forward network sub-layers. Explain the specific role of the self-attention mechanism in enabling this model to perform its primary function: generating a new token based solely on the sequence of tokens that came before it.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.2 Generative Models - Foundations of Large Language Models
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Training Decoder-Only Language Models with Cross-Entropy Loss
Output Probability Calculation in Transformer Language Models
Global Nature of Standard Transformer LLMs
Processing Flow of Autoregressive Generation in a Decoder-Only Transformer
Initial Input Representation for Transformer Layers
Greedy Decoding in Language Models
Structure of a Transformer Block
A generative language model is designed to produce text by predicting the next token based solely on the sequence of tokens that came before it. If you were to adapt a standard Transformer decoder block for this specific auto-regressive task, which of its sub-layers would you remove, and why is this modification functionally necessary?
A language model is constructed using a stack of modified Transformer decoder blocks. Each block contains a self-attention sub-layer and a feed-forward network sub-layer, but lacks the sub-layer that would process information from a separate, secondary input sequence. This model is capable of performing a machine translation task, such as translating a German sentence into English, without any further architectural changes.
Function of Self-Attention in Auto-regressive Generation
Neural Network-Based Next-Token Probability Distribution