During a reinforcement learning update step for a language model, a generated response (action) is evaluated and receives a significantly positive advantage score. Based on the structure of the policy gradient objective function, which aims to maximize the sum of log probabilities multiplied by advantages, what is the most direct consequence for the model's parameters?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
During a reinforcement learning update step for a language model, a generated response (action) is evaluated and receives a significantly positive advantage score. Based on the structure of the policy gradient objective function, which aims to maximize the sum of log probabilities multiplied by advantages, what is the most direct consequence for the model's parameters?
Analyzing a Policy Update Step
Analyzing Conflicting Signals in RL Fine-Tuning