Learn Before
Explaining the Token Selection Process
An autoregressive model has just processed a sequence of text and computed a probability for every single word in its vocabulary for the next position. In the context of the standard formula for this process, describe the specific mathematical operation used to select the single most likely next token from this set of probabilities.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Comprehension in Revised Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An autoregressive language model is generating text and needs to determine the next token. After processing the existing context (which is stored in its cache), the model's final layer outputs the following probabilities for a small subset of its vocabulary:
- P('mat' | cache) = 0.65
- P('floor' | cache) = 0.25
- P('sky' | cache) = 0.05
- P('the' | cache) = 0.05
According to the standard formula for selecting the single most likely next token, which token will be chosen?
Rationale for Conditioning on the Cache
Explaining the Token Selection Process
The formula
ŷ = argmax_y Pr(y|cache)implies that the selection of the next token is a deterministic process, where the single token with the highest calculated probability is always chosen.