Learn Before
A reward model is trained using a method where human annotators assign an absolute quality score to each response. The model's high sensitivity to disagreements among annotators is primarily a result of the regression algorithm's inherent difficulty in processing a wide numerical range of scores.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Diagnosing Issues in a Chatbot Training Pipeline
A team trains a reward model using a pointwise method where human annotators assign an absolute quality score from 1 to 10 to each generated text. The team finds that the final language model, trained using this reward model, performs poorly on prompts that differ even slightly from the training data. Which statement best analyzes the fundamental reason for this poor generalization?
A reward model is trained using a method where human annotators assign an absolute quality score to each response. The model's high sensitivity to disagreements among annotators is primarily a result of the regression algorithm's inherent difficulty in processing a wide numerical range of scores.