Formula

Output Probability Calculation in Transformer Language Models

In a Transformer language model comprising LL stacked blocks, the probability distribution for the next token is generated by applying a Softmax layer to the output of the final block. This involves multiplying the final block's output, HL\mathbf{H}^{L}, by a parameter weight matrix, WoRd×V\mathbf{W}^{o} \in \mathbb{R}^{d \times |V|}. This operation produces a sequence of probability distributions over the vocabulary, representing the conditional probability of the next token given the preceding sequence: [Pr(x0,...,xm1)Pr(x0,x1)Pr(x0)]=Softmax(HLWo)\begin{bmatrix} \Pr(\cdot | x_0,...,x_{m-1}) \\ \vdots \\ \Pr(\cdot | x_0,x_{1}) \\ \Pr(\cdot | x_0) \end{bmatrix} = \mathrm{Softmax}(\mathbf{H}^{L} \mathbf{W}^{o}).

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related