Learn Before
Next Token Prediction Formula
The final step in autoregressive generation is to select the next token by finding the one with the highest probability. This is formally expressed using the argmax function: Here, is the predicted token, and is any token in the vocabulary. The notation is used instead of to explicitly highlight that the decoding process relies directly on the context stored in the Key-Value (KV) cache, rather than reprocessing the original input at each step.

0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Next Token Prediction Formula
An autoregressive model is generating the 11th token of a sequence. The Key-Value (KV) Cache has already been populated with the key and value vectors for the first 10 tokens. For this 11th generation step, a new query (q_11), key (k_11), and value (v_11) vector are computed. Which of the following accurately describes the set of key vectors that the new query (q_11) will perform its attention operation over to produce the output for this step?
You are observing a single step of autoregressive generation in a transformer model, specifically for the token at position
i. Arrange the following computational events in the correct chronological order for this single step.Formula for Cache State Evolution during Autoregressive Decoding
Analyzing a Flawed KV Cache Implementation
Learn After
An autoregressive language model is generating text and needs to determine the next token. After processing the existing context (which is stored in its cache), the model's final layer outputs the following probabilities for a small subset of its vocabulary:
- P('mat' | cache) = 0.65
- P('floor' | cache) = 0.25
- P('sky' | cache) = 0.05
- P('the' | cache) = 0.05
According to the standard formula for selecting the single most likely next token, which token will be chosen?
Rationale for Conditioning on the Cache
Explaining the Token Selection Process
The formula
ŷ = argmax_y Pr(y|cache)implies that the selection of the next token is a deterministic process, where the single token with the highest calculated probability is always chosen.