Multiple Choice

A team trains a reward model using a pointwise method where human annotators assign an absolute quality score from 1 to 10 to each generated text. The team finds that the final language model, trained using this reward model, performs poorly on prompts that differ even slightly from the training data. Which statement best analyzes the fundamental reason for this poor generalization?

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science