Policy Gradient Objective Function for RL Fine-Tuning
The objective function for the reinforcement learning fine-tuning phase of RLHF is based on the policy gradient method. The goal is to update the language model's policy parameters, , to maximize the expected advantage of its actions. For a given trajectory , the objective function is defined as:
Here, is the probability of the policy taking action in state , and is the advantage function, which measures how much better that action is compared to the average action in that state.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Continuous Supervision from the RLHF Reward Model
A language model is being aligned using feedback from human preferences. A separate model is first trained to distinguish between pairs of model-generated responses, learning to identify the better one in each pair. This model is then used to assign a single numerical value to each new response generated by the language model, guiding its optimization. What is the most significant advantage of this two-stage process?
During the reinforcement learning phase of model alignment, the reward model's primary function is to output a binary classification for each generated response, labeling it as either 'preferred' or 'not preferred'.
The Reward Model's Functional Shift
Policy Gradient Objective Function for RL Fine-Tuning
Learn After
During a reinforcement learning update step for a language model, a generated response (action) is evaluated and receives a significantly positive advantage score. Based on the structure of the policy gradient objective function, which aims to maximize the sum of log probabilities multiplied by advantages, what is the most direct consequence for the model's parameters?
Analyzing a Policy Update Step
Analyzing Conflicting Signals in RL Fine-Tuning