1Cademy - Best Candidate Selection via Maximum Reward Score in BoN Sampling

Learn Before

Reward Model Selection in BoN Sampling

Formula

Best Candidate Selection via Maximum Reward Score in BoN Sampling

In Best-of-N (BoN) sampling, once a set of $N$ candidate outputs has been generated, a reward model $r$ evaluates each one. The final output, $\hat{\mathbf{y}}_{\text{best}}$ , is the candidate that receives the highest score. This selection process is represented by the formula: $\hat{\mathbf{y}}_{\text{best}} = \max\{r(\mathbf{x}, \hat{\mathbf{y}}_1), ..., r(\mathbf{x}, \hat{\mathbf{y}}_N)\}$ Although the max function technically returns the highest score value, this notation is commonly used as a shorthand to represent the selection of the argument (the candidate $\hat{\mathbf{y}}_i$ ) that yields this maximum score.

0

1

Updated 2026-07-03

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related

Learn After