1Cademy - Neural Network-Based Next-Token Probability Distribution

Learn Before

Conditional Probability of the Next Token
Decoder-Only Transformer as a Language Model

Formula

Neural Network-Based Next-Token Probability Distribution

Deep neural networks, such as a parameterized Transformer decoder denoted as $\mathrm{Decoder}_{\theta}(\cdot)$ , generate a probability distribution for the next token based on a sequence of preceding tokens, $x_0, \dots, x_i$ . This predicted distribution is represented as $\mathrm{Pr}_{\theta}(\cdot|x_0, \dots, x_i)$ , which is often abbreviated as $\mathbf{p}_{i+1}^{\theta}$ . The model's final output for that position is typically the token that receives the maximum probability.