Learn Before
Diagnosing Reward Model Failure
An AI development team is training a language model to be a helpful assistant. They first create a dataset where human labelers compare pairs of model-generated responses to the same prompt and choose the better one. This comparison data is used to train a separate 'scoring' model. The goal of this scoring model is to predict which response a human would prefer. Finally, the main language model is trained to generate responses that achieve a high score from this scoring model. After the process is complete, the team observes that the final language model often produces outputs that are grammatically correct and sound confident, but are factually incorrect or nonsensical. However, these incorrect responses consistently receive high scores from the scoring model. What is the most likely flaw in the training of the 'scoring' model that would lead to this specific outcome?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A development team has a pre-trained language model and wants to fine-tune it to produce responses that are more helpful and safe. Their strategy involves first creating a separate model whose sole job is to score how good a given response is, based on human preferences. Which of the following best describes the data and objective used to train this specific 'scoring' model?
You are tasked with aligning a large language model to better follow human preferences using a reward-based approach. Arrange the following high-level stages of the process into the correct chronological order.
Diagnosing Reward Model Failure
Rating LLM Outputs for Reward Models
Challenges of Rating LLM Outputs