Activity (Process)

High-Level Process of RLHF with PPO

The Reinforcement Learning from Human Feedback (RLHF) process, when implemented with Proximal Policy Optimization (PPO), involves a sequence of stages. The process starts with collecting preference data (e.g., ya ≻ yb), which is used to train a reward model. This reward model subsequently informs a value function. The final stage is policy training, where PPO is used to optimize the policy, which itself may have been initialized through Maximum Likelihood Estimation (MLE).

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related