Activity (Process)

Joint Optimization of Policy and Value Functions in RLHF

In the final stage of the RLHF process, the policy and value models undergo simultaneous training, guided by the previously trained reward model. This iterative update process occurs at each token position within a generated sequence. The value function's parameters are adjusted by minimizing the Mean Squared Error (MSE) of its predictions, while the policy is refined by minimizing the Proximal Policy Optimization (PPO) loss to encourage the generation of outputs that receive higher rewards.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences