Learn Before
An AI development team is in a phase of training where their goal is to make a language model's responses more aligned with human preferences. They use an optimization process that aims to minimize a loss function, L, which takes an input prompt x, a set of model-generated responses {y1, y2, ...}, and a component r as inputs. How does this loss function L primarily guide the model's policy towards generating better responses?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An AI development team is in a phase of training where their goal is to make a language model's responses more aligned with human preferences. They use an optimization process that aims to minimize a loss function,
L, which takes an input promptx, a set of model-generated responses{y1, y2, ...}, and a componentras inputs. How does this loss functionLprimarily guide the model's policy towards generating better responses?During the policy optimization phase where the objective is to minimize the loss function
L(x, {y1, y2}, r), the parameters of the reward modelrare updated simultaneously with the language model's policy to better reflect human preferences for the given promptx.Diagnosing Undesirable Behavior in Policy Optimization