1Cademy - A team trains a reward model using a pointwise method where human annotators assign an absolute quality score from 1 to 10 to each generated text. The team finds that the final language model, trained using this reward model, performs poorly on prompts that differ even slightly from the training data. Which statement best analyzes the fundamental reason for this poor generalization?

Learn Before

Limitations of the Pointwise Method in RLHF

Multiple Choice

A team trains a reward model using a pointwise method where human annotators assign an absolute quality score from 1 to 10 to each generated text. The team finds that the final language model, trained using this reward model, performs poorly on prompts that differ even slightly from the training data. Which statement best analyzes the fundamental reason for this poor generalization?

Updated 2025-10-02

Contributors are:

Who are from:

Learn Before

Related