1Cademy - Next Token Prediction Formula Using KV Cache

Learn Before

Argmax Formula for Next Token Prediction
Single-Step Autoregressive Generation with a Key-Value (KV) Cache

Formula

Next Token Prediction Formula Using KV Cache

The final step in autoregressive generation is to select the next token by finding the one with the highest probability. This is formally expressed using the argmax function: $\hat{\mathbf{y}} = \underset{\mathbf{y}}{\arg\max}, \text{Pr}(\mathbf{y}|\text{cache})$ Here, $\hat{\mathbf{y}}$ is the predicted token, and $\mathbf{y}$ is any token in the vocabulary. The notation $\text{Pr}(\mathbf{y}|\text{cache})$ is used instead of $\text{Pr}(\mathbf{y}|\mathbf{x})$ to explicitly highlight that the decoding process relies directly on the context stored in the Key-Value (KV) cache, rather than reprocessing the original input $\mathbf{x}$ at each step.

Updated 2026-06-20

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course

Learn After

An autoregressive language model is generating text and needs to determine the next token. After processing the existing context (which is stored in its cache), the model's final layer outputs the following probabilities for a small subset of its vocabulary:
- P('mat' | cache) = 0.65
- P('floor' | cache) = 0.25
- P('sky' | cache) = 0.05
- P('the' | cache) = 0.05
According to the standard formula for selecting the single most likely next token, which token will be chosen?
Rationale for Conditioning on the Cache
Explaining the Token Selection Process
The formula ŷ = argmax_y Pr(y|cache) implies that the selection of the next token is a deterministic process, where the single token with the highest calculated probability is always chosen.

Learn Before

Related

Learn After