1Cademy - Joint Optimization of Policy and Value Functions in RLHF

Learn Before

Policy Learning in RLHF

Activity (Process)

Joint Optimization of Policy and Value Functions in RLHF

In the final stage of the RLHF process, the policy and value models undergo simultaneous training, guided by the previously trained reward model. This iterative update process occurs at each token position within a generated sequence. The value function's parameters are adjusted by minimizing the Mean Squared Error (MSE) of its predictions, while the policy is refined by minimizing the Proximal Policy Optimization (PPO) loss to encourage the generation of outputs that receive higher rewards.