1Cademy - A language model is being aligned using feedback from human preferences. A separate model is first trained to distinguish between pairs of model-generated responses, learning to identify the better one in each pair. This model is then used to assign a single numerical value to each new response generated by the language model, guiding its optimization. What is the most significant advantage of this two-stage process?

Learn Before

Dual Role of the RLHF Reward Model: Ranking-based Training for Scoring Application

Multiple Choice

A language model is being aligned using feedback from human preferences. A separate model is first trained to distinguish between pairs of model-generated responses, learning to identify the better one in each pair. This model is then used to assign a single numerical value to each new response generated by the language model, guiding its optimization. What is the most significant advantage of this two-stage process?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related