1Cademy - RLHF Training Process with PPO

Learn Before

Reinforcement Learning from Human Feedback (RLHF)

Activity (Process)

RLHF Training Process with PPO

The Reinforcement Learning from Human Feedback (RLHF) process utilizing Proximal Policy Optimization (PPO) unfolds in several stages. Initially, human preference data is collected to train a reward model. Following the optimization of this reward model, the active training phase begins for both the target policy and the value function, using a baseline reference model. At every prediction step, the policy's parameters are updated by computing the sum of the PPO-based loss, which relies on the reward model, reference model, and current value function. Simultaneously, the value function is refined by minimizing the Mean Squared Error (MSE) loss.

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After