Learn Before
A development team has a pre-trained language model and wants to fine-tune it to produce responses that are more helpful and safe. Their strategy involves first creating a separate model whose sole job is to score how good a given response is, based on human preferences. Which of the following best describes the data and objective used to train this specific 'scoring' model?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A development team has a pre-trained language model and wants to fine-tune it to produce responses that are more helpful and safe. Their strategy involves first creating a separate model whose sole job is to score how good a given response is, based on human preferences. Which of the following best describes the data and objective used to train this specific 'scoring' model?
You are tasked with aligning a large language model to better follow human preferences using a reward-based approach. Arrange the following high-level stages of the process into the correct chronological order.
Diagnosing Reward Model Failure
Rating LLM Outputs for Reward Models
Challenges of Rating LLM Outputs