Logits in Transformer Language Models
In Transformer-based language models, logits are the raw, unnormalized scores that are output by the model's final linear layer before the application of a Softmax function. They are represented as a sequence of vectors, , where each vector corresponds to a token position in the sequence. These vectors are generated by projecting the final hidden states () into the vocabulary space, with each element in a vector representing the score for a potential token.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Logits in Transformer Language Models
A language model processes the following two sentences independently:
- 'The river bank was steep and muddy.'
- 'He withdrew cash from the bank.'
Considering the final layer of the model, how would the output vector (the final hidden state) for the word 'bank' in the first sentence compare to the output vector for 'bank' in the second sentence?
A language model with multiple layers processes an input sequence to predict the next token. For a single token within that sequence, arrange the following representations in the chronological order they are computed by the model.
A machine learning engineer is building a system to classify the sentiment of customer reviews (e.g., positive, negative). They decide to use the internal representations from a pre-trained, multi-layered language model as features for their classifier. Which of the following model outputs would provide the most contextually-rich and effective representation of an entire review for this classification task?
Logits in Transformer Language Models
Final Hidden States in a Transformer Language Model
Next-Token Probability Calculation in Autoregressive Decoders
Diagram of the Decoding Phase
Diagram of the Transformer Language Model Forward Pass
Diagram of the Autoregressive Generation Architectural Flow
A decoder-only language model generates text one token at a time in a step-by-step process. Arrange the following steps in the correct chronological order for generating a single new token, given an initial prompt and any previously generated tokens.
In the step-by-step generation process of a decoder-only language model, consider a hypothetical modification at generation step
i. Instead of using the initial prompt combined with all previously generated tokens as input, the model is only given the initial prompt. What is the most likely consequence of this change on the generated text?Diagnosing a Generation Failure in a Decoder-Only Model
Learn After
Output Probability Calculation in Transformer Language Models
A language model is tasked with predicting the next word for the sequence 'The cat sat on the'. After processing this input, the model's final linear layer produces a vector with 50,257 raw numerical scores, one for each word in its vocabulary. Which statement best characterizes this vector of raw scores, just before any final normalization function (like Softmax) is applied?
A language model has produced a vector of raw, unnormalized scores for all possible next words in its vocabulary. If a data scientist adds a constant value of 10 to every single score in this vector, the final probability assigned to each word will change.
Interpreting Model Output Scores