Concept

Continuous Supervision from the RLHF Reward Model

A key advantage of using the reward model for scoring is its ability to provide continuous supervision signals. Unlike a binary "good" or "bad" classification, the scalar score offers a nuanced gradient of quality, which is highly beneficial for effectively training other models, such as the policy model in the subsequent RL stage.

0

1

Updated 2026-04-20

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences