Learn Before
Example of Top-k Sampling with k=3
This example illustrates the top-k sampling process with k=3. First, five candidate words are ranked by their initial probabilities: 'cute' (Pr=0.34), 'on' (Pr=0.32), 'sick' (Pr=0.21), 'are' (Pr=0.12), and '.' (Pr=0.01). Next, the top k=3 candidates ('cute', 'on', 'sick') are selected, and the rest are pruned. The probabilities of these selected candidates are then renormalized to sum to 1, yielding new probabilities: 'cute' (Pr=0.39), 'on' (Pr=0.36), and 'sick' (Pr=0.25). Finally, a token is chosen by sampling from this new distribution, resulting in 'on' being selected as the final output.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Example of Top-k Sampling with k=3
Top-k Selection Pool
Probability Renormalization Formula for Restricted Vocabulary Sampling
Probability Renormalization Formula for Top-k Sampling
A language model is generating the next word in a sequence and has calculated the initial probabilities for the five most likely candidates:
the(0.4),a(0.2),one(0.1),his(0.05), andher(0.05). If the model uses a sampling strategy where it only considers the top 3 most likely candidates (k=3), what will be the new, rescaled probability distribution for this reduced set of candidates from which the final word will be sampled?Arrange the following actions into the correct sequence that describes the process of selecting the next token in a text generation model using the top-k sampling method.
Analyzing Text Generation Outputs
Learn After
A language model is generating the next word in a sequence and has calculated the initial probabilities for six potential words: 'the' (0.40), 'a' (0.25), 'an' (0.15), 'some' (0.10), 'any' (0.05), and 'every' (0.05). The system uses a decoding strategy where it only considers the top 4 most likely candidates for the final selection. After discarding the other candidates, the probabilities of the remaining words are adjusted to sum to 1. What is the adjusted probability for the word 'a'?
A text generation model uses a method to select the next word where it only considers a small, fixed number of the most probable options. Arrange the following steps to accurately describe the sequence of this method.
Inferring Decoding Parameters