Learn Before
RLHF Policy Optimization Objective
The goal of the policy training stage in Reinforcement Learning from Human Feedback (RLHF) is to find the optimal policy parameters that maximize expected reward without deviating too far from a reference policy. The training objective evaluates the quality of an output given an input using a reward model . The objective minimizes the negative reward (loss) and includes a penalty for policy divergence:
Here, the penalty regularizes the current policy against the reference policy using a coefficient .

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Objective Function for Policy Learning in RLHF
Use of Proximal Policy Optimization (PPO) in RLHF
Application of A2C in RLHF for LLM Alignment
Role and Definition of the Reference Model in RLHF
Joint Optimization of Policy and Value Functions in RLHF
RLHF Policy Optimization Objective
Reference Policy in RLHF
RLHF Policy Optimization as Loss Minimization
A language model is being fine-tuned using an iterative feedback process. In each step, the model generates a response to a prompt. A separate, pre-trained scoring model then assigns a numerical score to this response based on its quality. What is the most direct and immediate use of this numerical score within a single step of this training loop?
Arrange the following events into the correct chronological order as they would occur within a single iterative step of the policy learning phase for a language model.
Diagnosing a Training Failure in an Iterative Fine-Tuning Process
Direct Preference Optimization (DPO)
RLHF Policy Optimization Objective
Policy Divergence Penalty for Language Models
KL-Divergence Penalty in RLHF Policy Optimization
An AI development team is fine-tuning a language model using a reinforcement learning process guided by a reward model. They observe that the model's outputs, while receiving high scores from the reward model, are becoming stylistically unnatural and deviating significantly from the helpful tone established during its initial supervised training. Which of the following adjustments to the training process is most specifically designed to counteract this behavioral drift?
Diagnosing and Mitigating Reward Hacking
Consequences of Omitting a Reference Policy in RLHF
Learn After
PPO Objective for LLM Training
Derivation of the KL Divergence Objective for Policy Optimization
During the policy optimization stage of training a large language model, an engineer observes that the model's outputs are coherent and safe, but they show very little improvement over the initial supervised fine-tuned version and consistently receive mediocre scores from the reward model. Which of the following is the most likely cause of this issue, based on the policy optimization objective function that balances maximizing rewards with a penalty for policy divergence?
Analyzing the Trade-off in Policy Optimization
Analyzing a Modified Policy Optimization Objective