Analyzing a Policy Update Step
A language model is being fine-tuned using a reinforcement learning process. During one update step, for the same initial text, the model considers two different generated continuations. Continuation A receives an advantage score of +2.0, while Continuation B receives an advantage score of -1.5. Based on the objective function that aims to maximize the sum of log probabilities of actions multiplied by their advantages, describe the effect of this single update step on the model's tendency to generate each of these continuations in the future. Explain your reasoning for both continuations.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
During a reinforcement learning update step for a language model, a generated response (action) is evaluated and receives a significantly positive advantage score. Based on the structure of the policy gradient objective function, which aims to maximize the sum of log probabilities multiplied by advantages, what is the most direct consequence for the model's parameters?
Analyzing a Policy Update Step
Analyzing Conflicting Signals in RL Fine-Tuning