Activity (Process)

Reward Model Selection in BoN Sampling

After generating a set of NN-best candidate outputs in Best-of-NN (BoN) sampling, a reward model evaluates each candidate to perform the final selection. The model calculates a reward score for each input-output pair and selects the candidate with the highest score, formally defined as:

y^best=max{r(x,y^1),...,r(x,y^N)}\hat{\mathbf{y}}_{\mathrm{best}} = \max\{r(\mathbf{x},\hat{\mathbf{y}}_1),...,r(\mathbf{x},\hat{\mathbf{y}}_N)\}

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.4 Alignment - Foundations of Large Language Models