Formula

Next Token Prediction Formula

The final step in autoregressive generation is to select the next token by finding the one with the highest probability. This is formally expressed using the argmax function: y^=argmaxyPr(ycache)\hat{\mathbf{y}} = \underset{\mathbf{y}}{\arg\max}\, \text{Pr}(\mathbf{y}|\text{cache}) Here, y^\hat{\mathbf{y}} is the predicted token, and y\mathbf{y} is any token in the vocabulary. The notation Pr(ycache)\text{Pr}(\mathbf{y}|\text{cache}) is used instead of Pr(yx)\text{Pr}(\mathbf{y}|\mathbf{x}) to explicitly highlight that the decoding process relies directly on the context stored in the Key-Value (KV) cache, rather than reprocessing the original input x\mathbf{x} at each step.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences