Formula

Next-Token Probability Calculation in a Transformer Decoder

In a standard Transformer decoder architecture, the probability distribution for the next token is computed in two steps. First, the decoder model (Dec) processes the concatenation of the input sequence x\mathbf{x} and the previously generated output tokens y<i\mathbf{y}_{<i} to produce a final sequence of representations, H\mathbf{H}. Second, this representation is multiplied by an output projection matrix Wo\mathbf{W}^o and passed through a Softmax function to yield the probabilities for the next tokens. The formulas are:

H=Dec([x,y<i])\mathbf{H} = \mathrm{Dec}([\mathbf{x}, \mathbf{y}_{<i}])

Pr(x,y<i)=Softmax(HWo)m+i\Pr(\cdot|\mathbf{x}, \mathbf{y}_{<i}) = \mathrm{Softmax}(\mathbf{H}\mathbf{W}^o)_{m+i}

The subscript m+im+i indicates that the calculation is performed at the current decoding step, after processing mm input tokens and i1i-1 previously generated output tokens.

Image 0

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related