Initial Input Representation for Transformer Layers
In a decoder-only Transformer model, the sequence of input tokens is represented by a sequence of -dimensional vectors, denoted as . For a given position , the vector is computed as the sum of the token embedding for the specific token and its corresponding positional embedding. This final sequence of vectors forms the initial input that is fed into the stack of Transformer blocks.
0
1
References
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.5 Inference - Foundations of Large Language Models
Related
Training Decoder-Only Language Models with Cross-Entropy Loss
Output Probability Calculation in Transformer Language Models
Global Nature of Standard Transformer LLMs
Processing Flow of Autoregressive Generation in a Decoder-Only Transformer
Initial Input Representation for Transformer Layers
Greedy Decoding in Language Models
Structure of a Transformer Block
A generative language model is designed to produce text by predicting the next token based solely on the sequence of tokens that came before it. If you were to adapt a standard Transformer decoder block for this specific auto-regressive task, which of its sub-layers would you remove, and why is this modification functionally necessary?
A language model is constructed using a stack of modified Transformer decoder blocks. Each block contains a self-attention sub-layer and a feed-forward network sub-layer, but lacks the sub-layer that would process information from a separate, secondary input sequence. This model is capable of performing a machine translation task, such as translating a German sentence into English, without any further architectural changes.
Function of Self-Attention in Auto-regressive Generation
Neural Network-Based Next-Token Probability Distribution
Self-Attention layer understanding - Step 5 - Adding the time
Input Embedding with Positional Encoding
Learnable Absolute Positional Embeddings
Initial Input Representation for Transformer Layers
Comparison of Arbitrary Order Prediction and Masked Language Modeling
An engineer builds a language model where all input words in a sentence are processed simultaneously and independently before their information is combined. When testing the model with the sentences 'The cat chased the dog' and 'The dog chased the cat', the engineer observes that the model generates identical internal representations for both, failing to capture their different meanings. Which of the following modifications would most directly address this fundamental flaw?
Model Architecture Design Choice
Analyzing Order-Insensitivity in Language Models
Learn After
Layer-wise Processing in Transformer Inference
Initial Representation for Concatenated [x, y] Sequences
Calculating an Initial Input Vector
A decoder-only model is preparing the input sequence 'The quick brown fox' for processing. To create the initial input representation for the token 'brown' (at position 2), the model retrieves its token embedding vector,
V_brown, and the positional embedding vector for position 2,P_2. Which of the following correctly describes the operation used to combine these two vectors into the final representation that is fed into the first layer of the model?A decoder-only Transformer model is given a sequence of tokens as input. Arrange the following steps in the correct chronological order to describe how the model creates the initial representation that is fed into its first layer.
Input Representation for a Single Token in Autoregressive Generation