1Cademy - Analyzing a Policy Update Step

Learn Before

Policy Gradient Objective Function for RL Fine-Tuning

Case Study

Analyzing a Policy Update Step

A language model is being fine-tuned using a reinforcement learning process. During one update step, for the same initial text, the model considers two different generated continuations. Continuation A receives an advantage score of +2.0, while Continuation B receives an advantage score of -1.5. Based on the objective function that aims to maximize the sum of log probabilities of actions multiplied by their advantages, describe the effect of this single update step on the model's tendency to generate each of these continuations in the future. Explain your reasoning for both continuations.

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Learn Before

Related