Learn Before
Rationale for Conditioning on the Cache
In the formula for selecting the next token, ŷ = argmax_y Pr(y|cache), the probability is conditioned on the 'cache' rather than the original input sequence. Explain the primary reason for this distinction and what it implies about the efficiency of the generation process.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An autoregressive language model is generating text and needs to determine the next token. After processing the existing context (which is stored in its cache), the model's final layer outputs the following probabilities for a small subset of its vocabulary:
- P('mat' | cache) = 0.65
- P('floor' | cache) = 0.25
- P('sky' | cache) = 0.05
- P('the' | cache) = 0.05
According to the standard formula for selecting the single most likely next token, which token will be chosen?
Rationale for Conditioning on the Cache
Explaining the Token Selection Process
The formula
ŷ = argmax_y Pr(y|cache)implies that the selection of the next token is a deterministic process, where the single token with the highest calculated probability is always chosen.