When fine-tuning a language model with a reward signal, an optimization method like Proximal Policy Optimization (PPO) is used. A correct implementation of this method would prioritize maximizing the reward score above all else, allowing for significant and unconstrained changes to the model's policy in each training step to quickly find high-reward outputs.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.4 Alignment - Foundations of Large Language Models
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
PPO Clipped Surrogate Objective in RLHF
Advantage Function Estimation in RLHF
PPO Objective Formula for LLM Training in RLHF
Diagnosing Training Instability in Language Model Fine-Tuning
A team is fine-tuning a language model using a reinforcement learning process. In each step, the model generates a response to a prompt, a separate reward model scores the response, and the language model's parameters are updated based on this score. The team finds that a simple update rule, which aggressively maximizes the immediate reward, often leads to 'policy collapse'—the model's linguistic quality degrades, and it starts generating repetitive, nonsensical text that happens to exploit the reward model. What is the primary reason for employing an algorithm like Proximal Policy Optimization (PPO) in this scenario?
When fine-tuning a language model with a reward signal, an optimization method like Proximal Policy Optimization (PPO) is used. A correct implementation of this method would prioritize maximizing the reward score above all else, allowing for significant and unconstrained changes to the model's policy in each training step to quickly find high-reward outputs.