Learn Before
A team is fine-tuning a language model where the only goal is to adjust the model's parameters to maximize the average score from a fixed reward model. After many training iterations, the team observes that while the policy consistently achieves high reward scores, the generated text is becoming repetitive and stylistically unnatural. What is the most likely reason for this outcome, based on the optimization objective?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
KL-Divergence Penalty in RLHF Policy Optimization
A team is fine-tuning a language model where the only goal is to adjust the model's parameters to maximize the average score from a fixed reward model. After many training iterations, the team observes that while the policy consistently achieves high reward scores, the generated text is becoming repetitive and stylistically unnatural. What is the most likely reason for this outcome, based on the optimization objective?
Diagnosing Undesirable Model Behavior
Match each mathematical component from the policy learning objective function with its conceptual role in the training process.