Formula

Neural Network-Based Next-Token Probability Distribution

Deep neural networks, such as a parameterized Transformer decoder denoted as Decoderθ()\mathrm{Decoder}_{\theta}(\cdot), generate a probability distribution for the next token based on a sequence of preceding tokens, x0,,xix_0, \dots, x_i. This predicted distribution is represented as Prθ(x0,,xi)\mathrm{Pr}_{\theta}(\cdot|x_0, \dots, x_i), which is often abbreviated as pi+1θ\mathbf{p}_{i+1}^{\theta}. The model's final output for that position is typically the token that receives the maximum probability.

0

1

Updated 2026-04-15

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.1 Pre-training - Foundations of Large Language Models

Related