Objective Function for Policy Learning in RLHF
The objective in the policy learning phase of Reinforcement Learning from Human Feedback (RLHF) is to find the optimal policy parameters, denoted as , that maximize the expected reward. The optimization process starts with the parameters of a pre-trained model, , and seeks to maximize the reward assigned by a learned reward model, . The formal expression is:
Here:
- are the optimized policy parameters.
- indicates that we are searching for the parameters that maximize the objective, starting from the initial parameters .
- represents the expected value over the dataset . For each input from the dataset, a response is generated by the current policy.
- is the score assigned by the reward model (with parameters ) to the generated response for the given input.

0
1
References
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.4 Alignment - Foundations of Large Language Models
Related
Objective Function for Policy Learning in RLHF
Use of Proximal Policy Optimization (PPO) in RLHF
Application of A2C in RLHF for LLM Alignment
Role and Definition of the Reference Model in RLHF
Joint Optimization of Policy and Value Functions in RLHF
RLHF Policy Optimization Objective
Reference Policy in RLHF
RLHF Policy Optimization as Loss Minimization
A language model is being fine-tuned using an iterative feedback process. In each step, the model generates a response to a prompt. A separate, pre-trained scoring model then assigns a numerical score to this response based on its quality. What is the most direct and immediate use of this numerical score within a single step of this training loop?
Arrange the following events into the correct chronological order as they would occur within a single iterative step of the policy learning phase for a language model.
Diagnosing a Training Failure in an Iterative Fine-Tuning Process
Direct Preference Optimization (DPO)
Objective Function for Policy Learning in RLHF
A language model generates a response that is evaluated by breaking it into four distinct segments. A reward function assigns a score to each segment based on its quality. The scores for the segments are: Segment 1: +1.2, Segment 2: -0.5, Segment 3: +0.8, and Segment 4: -0.2. If the total reward for the entire response is calculated by summing the rewards of its individual segments, what is the total reward?
A language model generates a three-paragraph summary of a research paper. The first paragraph accurately introduces the paper's objective. The second paragraph correctly describes the methodology but contains a significant factual error about the main finding. The third paragraph draws a logical, but ultimately incorrect, conclusion based on the error in the second paragraph. If the total quality score for the summary is calculated as the sum of scores from each paragraph (segment), which segment is most likely to receive the lowest score?
Debugging a Recipe-Generating Language Model
Learn After
KL-Divergence Penalty in RLHF Policy Optimization
A team is fine-tuning a language model where the only goal is to adjust the model's parameters to maximize the average score from a fixed reward model. After many training iterations, the team observes that while the policy consistently achieves high reward scores, the generated text is becoming repetitive and stylistically unnatural. What is the most likely reason for this outcome, based on the optimization objective?
Diagnosing Undesirable Model Behavior
Match each mathematical component from the policy learning objective function with its conceptual role in the training process.