1Cademy - During a reinforcement learning update step for a language model, a generated response (action) is evaluated and receives a significantly positive advantage score. Based on the structure of the policy gradient objective function, which aims to maximize the sum of log probabilities multiplied by advantages, what is the most direct consequence for the models parameters?

Learn Before

Policy Gradient Objective Function for RL Fine-Tuning

Multiple Choice

During a reinforcement learning update step for a language model, a generated response (action) is evaluated and receives a significantly positive advantage score. Based on the structure of the policy gradient objective function, which aims to maximize the sum of log probabilities multiplied by advantages, what is the most direct consequence for the model's parameters?

Updated 2025-10-07

Contributors are:

Who are from:

Learn Before

Related