1Cademy - Policy Learning in RLHF

Learn Before

Reward Model Learning in RLHF
Reinforcement Learning from Human Feedback (RLHF)
Dataset Composition for RL Fine-Tuning in RLHF

Activity (Process)

Policy Learning in RLHF

The policy learning stage in RLHF is an iterative fine-tuning process. For each step, a prompt, $x$ , is sampled from a dataset, $D$ . The current language model, acting as the policy, then generates a corresponding output, $y$ , by sampling from its probability distribution, $P_r(y|x)$ . This input-output pair, {x, y}, is evaluated by the trained reward model, which assigns it a numerical reward score, $r(x, y)$ . This score serves as the feedback signal for a reinforcement learning algorithm, which updates the policy's parameters to favor outputs that receive higher rewards.