Formula

Next-Token Probability Calculation in Autoregressive Decoders

In an autoregressive decoder, the probability for the next token is calculated at each step i by conditioning on the input x and all previously generated tokens y_{<i}. The process begins by concatenating x and y_{<i} and passing them through an embedding layer. This sequence of embeddings is then processed by a stack of decoder layers (which typically include self-attention and feed-forward networks) to produce a sequence of hidden states, H. A final linear transformation (using an output weight matrix W^o) is applied to these hidden states to get logits, followed by a Softmax function. The probability distribution for the next token, y_i, is the probability vector taken from the final position of the output sequence. The formula is: Pr(·|x, y_{<i}) = (\text{Softmax}(H W^o))_{\text{last}} where H = \text{Decoder}([x, y_{<i}]).

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related