Learn Before
Reward Model Selection in BoN Sampling
After generating a set of -best candidate outputs in Best-of- (BoN) sampling, a reward model evaluates each candidate to perform the final selection. The model calculates a reward score for each input-output pair and selects the candidate with the highest score, formally defined as:
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.4 Alignment - Foundations of Large Language Models
Related
Input and Output Formulation in BoN Sampling
Generating N-Best Candidates in BoN Sampling
Reward Model Selection in BoN Sampling
Rejection Sampling for LLM Fine-Tuning
A company wants to improve the safety and helpfulness of its AI assistant without the high cost and time of retraining the entire base model. They propose a new system for handling user queries: for each query, the system will first generate 10 different potential responses. Then, a separate, fast-acting 'quality-scoring' model will evaluate all 10 responses based on pre-defined criteria. Finally, the system will present only the single response that received the highest score to the user. What is the most significant trade-off of this approach compared to simply using the first response the base model generates?
A system is designed to improve the quality of its generated text by producing multiple options and then picking the best one. Arrange the following steps of this process in the correct logical order.
Chatbot Response Quality Improvement
Learn After
Best Candidate Selection via Maximum Reward Score in BoN Sampling
An AI system generates four possible summaries for a user's request. A scoring mechanism then evaluates each summary for quality, assigning a numerical score where higher is better. Based on the scores below, which summary would be selected as the final output?
- Summary A: Score 0.85
- Summary B: Score -0.20
- Summary C: Score 1.50
- Summary D: Score 1.15
An AI system is designed to generate helpful and safe responses. For a given prompt, it first creates three distinct candidate responses. A secondary component then scores each candidate for helpfulness and safety, and the response with the highest score is selected as the final output. If the system ultimately produces a response that is factually incorrect and unhelpful, which of the following is the most likely point of failure in the process?
Consider a system that first generates a diverse set of potential answers to a prompt and then uses a separate scoring component to select the single best answer to show the user. In this system, the quality of the final, user-facing answer is determined exclusively by the quality of the initial set of potential answers.