Next-Token Probability Calculation in a Transformer Decoder
In a standard Transformer decoder architecture, the probability distribution for the next token is computed in two steps. First, the decoder model (Dec) processes the concatenation of the input sequence and the previously generated output tokens to produce a final sequence of representations, . Second, this representation is multiplied by an output projection matrix and passed through a Softmax function to yield the probabilities for the next tokens. The formulas are:
The subscript indicates that the calculation is performed at the current decoding step, after processing input tokens and previously generated output tokens.

0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
The Search Problem in LLM Inference
Next-Token Probability Calculation in a Transformer Decoder
In an autoregressive language model, after processing a sequence of input tokens, a corresponding sequence of hidden state vectors is produced by the final decoder layer. To predict the probability distribution for the single token that will come next, what is the correct procedure and why?
An autoregressive model generates text one token at a time. Arrange the following computational steps in the correct order to calculate the probability distribution for the very next token, given the current sequence of tokens.
Debugging a Language Model's Output Distribution
Layer-wise Processing in Transformer Inference
Formula for KV Cache Prefilling
A researcher is building a sequence processing model and describes one of its core layers. The layer is designed to first apply a self-attention mechanism to its input sequence, and then, for each position in the sequence, it applies the same two-layer neural network independently. Based on this description, which statement accurately identifies a potential flaw or misunderstanding in the researcher's design compared to a standard Transformer decoding network layer?
A single token's data is being processed by a standard Transformer decoding network. Arrange the following operations in the correct sequence as the data flows through the network's core components, starting from the initial input.
Diagnosing a Faulty Decoding Network
Match each core component of a Transformer decoding network to its primary function within the network's architecture.
Next-Token Probability Calculation in a Transformer Decoder
Learn After
A developer is implementing a text-generation model. During the decoding process for each new token, their model first computes a final hidden state vector from the decoder. They then immediately apply a Softmax function to this hidden state vector to get a probability distribution for the next token. Which statement best analyzes the flaw in this approach?
A Transformer-based language model has a final hidden state dimension of 768 for each token position. The model's vocabulary consists of 50,000 unique tokens. To compute the probability distribution for the next token, the final hidden state vector is multiplied by an output projection matrix before the Softmax function is applied. What must be the dimensions of this output projection matrix?
A Transformer decoder is generating the next token in a sequence. Arrange the following computational steps in the correct order to produce the final probability distribution over the vocabulary.