1Cademy - High-Level Process of RLHF with PPO

Learn Before

Reinforcement Learning from Human Feedback (RLHF)

Activity (Process)

High-Level Process of RLHF with PPO

The Reinforcement Learning from Human Feedback (RLHF) process, when implemented with Proximal Policy Optimization (PPO), involves a sequence of stages. The process starts with collecting preference data (e.g., ya ≻ yb), which is used to train a reward model. This reward model subsequently informs a value function. The final stage is policy training, where PPO is used to optimize the policy, which itself may have been initialized through Maximum Likelihood Estimation (MLE).

Updated 2025-10-10

Contributors are: