Learn Before
Diagnosing Undesirable Behavior in Policy Optimization
An engineering team is optimizing a language model's policy by minimizing a loss function, L(x, {y1, y2}, r), where r is a pre-trained model that scores responses based on human preferences. The team observes that while the loss L is consistently decreasing, the language model is increasingly producing overly cautious and generic responses, often refusing to answer harmless questions. Based on the structure of the optimization objective, which component is the most likely source of this undesirable behavior, and why?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An AI development team is in a phase of training where their goal is to make a language model's responses more aligned with human preferences. They use an optimization process that aims to minimize a loss function,
L, which takes an input promptx, a set of model-generated responses{y1, y2, ...}, and a componentras inputs. How does this loss functionLprimarily guide the model's policy towards generating better responses?During the policy optimization phase where the objective is to minimize the loss function
L(x, {y1, y2}, r), the parameters of the reward modelrare updated simultaneously with the language model's policy to better reflect human preferences for the given promptx.Diagnosing Undesirable Behavior in Policy Optimization