Learn Before
Structure of a Transformer Block
The core component of a Transformer model is the Transformer block, also referred to as a layer. Each block consists of two main sub-layers stacked sequentially: a self-attention sub-layer, which processes relationships between tokens in the sequence, and a feed-forward network (FFN) sub-layer for additional computation. These sub-layers can be arranged using different normalization schemes, such as the post-norm architecture.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Training Decoder-Only Language Models with Cross-Entropy Loss
Output Probability Calculation in Transformer Language Models
Global Nature of Standard Transformer LLMs
Processing Flow of Autoregressive Generation in a Decoder-Only Transformer
Initial Input Representation for Transformer Layers
Greedy Decoding in Language Models
Structure of a Transformer Block
A generative language model is designed to produce text by predicting the next token based solely on the sequence of tokens that came before it. If you were to adapt a standard Transformer decoder block for this specific auto-regressive task, which of its sub-layers would you remove, and why is this modification functionally necessary?
A language model is constructed using a stack of modified Transformer decoder blocks. Each block contains a self-attention sub-layer and a feed-forward network sub-layer, but lacks the sub-layer that would process information from a separate, secondary input sequence. This model is capable of performing a machine translation task, such as translating a German sentence into English, without any further architectural changes.
Function of Self-Attention in Auto-regressive Generation
Neural Network-Based Next-Token Probability Distribution
Learn After
Formula for Post-Normalization in a Transformer Sub-layer
A standard Transformer block processes an input sequence through two main sub-layers using a post-normalization scheme. Arrange the following operations in the correct order from start to finish for a single block.
A language model built with Transformer blocks consistently produces grammatically correct sentences, but the sentences lack contextual coherence. For instance, given the input 'The scientist carefully placed the sample under the microscope to observe its...', the model generates '...color is a vibrant shade of the car.' Which sub-layer within the Transformer blocks is most likely failing to perform its primary function, leading to this specific type of error?
Component Roles in a Transformer Block
Transformer Block Inputs and Outputs Notation