Activity (Process)

Policy Learning in RLHF

The policy learning stage in RLHF is an iterative fine-tuning process. For each step, a prompt, xx, is sampled from a dataset, DD. The current language model, acting as the policy, then generates a corresponding output, yy, by sampling from its probability distribution, Pr(yx)P_r(y|x). This input-output pair, {x, y}, is evaluated by the trained reward model, which assigns it a numerical reward score, r(x,y)r(x, y). This score serves as the feedback signal for a reinforcement learning algorithm, which updates the policy's parameters to favor outputs that receive higher rewards.

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related