1Cademy - Mathematical Representation of the Top-p Candidate Pool

Learn Before

Candidate Pool Size in Top-p Sampling (kp)

Formula

Mathematical Representation of the Top-p Candidate Pool

In top-p (nucleus) sampling, the candidate pool at a given step $i$ , denoted as $\overline{V}_i$ , is composed of the $k_p$ most probable tokens. The value of $k_p$ is the size of the smallest set of top-ranked tokens whose cumulative probability meets or exceeds the threshold $p$ . The pool is formally represented as the set of these top $k_p$ tokens: $\overline{V}_i = \{y_i^{\text{top1}}, \dots, y_i^{\text{topk}_p}\}$

Updated 2026-07-04

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn After

At a specific step 'i' in a text generation process, the model has calculated the following probabilities for the next token from a vocabulary of {A, B, C, D, E}:

P(A) = 0.40 P(B) = 0.30 P(C) = 0.15 P(D) = 0.10 P(E) = 0.05

If the sampling process uses a probability threshold 'p' of 0.8, which of the following sets correctly represents the candidate pool of tokens, denoted as $\overline{V}_i$ ?
Constructing the Top-p Candidate Pool
A language model's output probabilities for the next token are sorted in descending order. The candidate pool for sampling, represented as $\overline{V}_i = \{y_i^{\text{top1}}, \dots, y_i^{\text{topk}_p}\}$ , is constructed by including all tokens whose individual probability is greater than the sampling threshold $p$ .

Learn Before

Related

Learn After