1Cademy - Next-Token Probability Calculation in a Transformer Decoder

Learn Before

Formula

Next-Token Probability Calculation in a Transformer Decoder

In a standard Transformer decoder architecture, the probability distribution for the next token is computed in two steps. First, the decoder model (Dec) processes the concatenation of the input sequence $\mathbf{x}$ and the previously generated output tokens $\mathbf{y}_{<i}$ to produce a final sequence of representations, $\mathbf{H}$ . Second, this representation is multiplied by an output projection matrix $\mathbf{W}^o$ and passed through a Softmax function to yield the probabilities for the next tokens. The formulas are:

$\mathbf{H} = \mathrm{Dec}([\mathbf{x}, \mathbf{y}_{<i}])$

$\Pr(\cdot|\mathbf{x}, \mathbf{y}_{<i}) = \mathrm{Softmax}(\mathbf{H}\mathbf{W}^o)_{m+i}$

The subscript $m+i$ indicates that the calculation is performed at the current decoding step, after processing $m$ input tokens and $i-1$ previously generated output tokens.

0

1

Updated 2026-05-03

Contributors are:

Who are from:

References

Learn Before

Related

Learn After