1Cademy - Reward Model Selection in BoN Sampling

Learn Before

Best-of-N Sampling (BoN Sampling)

Activity (Process)

Reward Model Selection in BoN Sampling

After generating a set of $N$ -best candidate outputs in Best-of- $N$ (BoN) sampling, a reward model evaluates each candidate to perform the final selection. The model calculates a reward score for each input-output pair and selects the candidate with the highest score, formally defined as:

$\hat{\mathbf{y}}_{\mathrm{best}} = \max\{r(\mathbf{x},\hat{\mathbf{y}}_1),...,r(\mathbf{x},\hat{\mathbf{y}}_N)\}$

Updated 2026-05-03

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course

Learn After

Best Candidate Selection via Maximum Reward Score in BoN Sampling
An AI system generates four possible summaries for a user's request. A scoring mechanism then evaluates each summary for quality, assigning a numerical score where higher is better. Based on the scores below, which summary would be selected as the final output?
- Summary A: Score 0.85
- Summary B: Score -0.20
- Summary C: Score 1.50
- Summary D: Score 1.15
An AI system is designed to generate helpful and safe responses. For a given prompt, it first creates three distinct candidate responses. A secondary component then scores each candidate for helpfulness and safety, and the response with the highest score is selected as the final output. If the system ultimately produces a response that is factually incorrect and unhelpful, which of the following is the most likely point of failure in the process?
Consider a system that first generates a diverse set of potential answers to a prompt and then uses a separate scoring component to select the single best answer to show the user. In this system, the quality of the final, user-facing answer is determined exclusively by the quality of the initial set of potential answers.

Learn Before

Related

Learn After