Final Hidden States in a Transformer Language Model
In a Transformer-based language model with layers, the final hidden states are the sequence of output vectors from the last Transformer block, denoted as . Each vector represents the contextualized embedding of the -th token after processing through the entire stack of layers. This sequence of vectors encapsulates the model's final understanding of the input sequence and is used as the basis for subsequent predictions, such as generating logits for the next token.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Logits in Transformer Language Models
Final Hidden States in a Transformer Language Model
Next-Token Probability Calculation in Autoregressive Decoders
Diagram of the Decoding Phase
Diagram of the Transformer Language Model Forward Pass
Diagram of the Autoregressive Generation Architectural Flow
A decoder-only language model generates text one token at a time in a step-by-step process. Arrange the following steps in the correct chronological order for generating a single new token, given an initial prompt and any previously generated tokens.
In the step-by-step generation process of a decoder-only language model, consider a hypothetical modification at generation step
i. Instead of using the initial prompt combined with all previously generated tokens as input, the model is only given the initial prompt. What is the most likely consequence of this change on the generated text?Diagnosing a Generation Failure in a Decoder-Only Model
Learn After
Logits in Transformer Language Models
A language model processes the following two sentences independently:
- 'The river bank was steep and muddy.'
- 'He withdrew cash from the bank.'
Considering the final layer of the model, how would the output vector (the final hidden state) for the word 'bank' in the first sentence compare to the output vector for 'bank' in the second sentence?
A language model with multiple layers processes an input sequence to predict the next token. For a single token within that sequence, arrange the following representations in the chronological order they are computed by the model.
A machine learning engineer is building a system to classify the sentiment of customer reviews (e.g., positive, negative). They decide to use the internal representations from a pre-trained, multi-layered language model as features for their classifier. Which of the following model outputs would provide the most contextually-rich and effective representation of an entire review for this classification task?