Analyzing Conflicting Signals in RL Fine-Tuning
During the reinforcement learning fine-tuning of a language model, a specific action a_t (generating a token) is taken in state s_t. This action has a very high probability under the current policy, π_θ(a_t|s_t). However, the advantage function A(s_t, a_t) for this action is a large negative value. Considering just this single term in the policy gradient objective function, describe the resulting effect on the model's parameters during the update step and explain the reasoning behind this effect.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
During a reinforcement learning update step for a language model, a generated response (action) is evaluated and receives a significantly positive advantage score. Based on the structure of the policy gradient objective function, which aims to maximize the sum of log probabilities multiplied by advantages, what is the most direct consequence for the model's parameters?
Analyzing a Policy Update Step
Analyzing Conflicting Signals in RL Fine-Tuning