1Cademy - A team is fine-tuning a language model using a reinforcement learning process. In each step, the model generates a response to a prompt, a separate reward model scores the response, and the language models parameters are updated based on this score. The team finds that a simple update rule, which aggressively maximizes the immediate reward, often leads to policy collapse—the models linguistic quality degrades, and it starts generating repetitive, nonsensical text that happens to exploit the reward model. What is the primary reason for employing an algorithm like Proximal Policy Optimization (PPO) in this scenario?

Learn Before

Use of Proximal Policy Optimization (PPO) in RLHF

Multiple Choice

A team is fine-tuning a language model using a reinforcement learning process. In each step, the model generates a response to a prompt, a separate reward model scores the response, and the language model's parameters are updated based on this score. The team finds that a simple update rule, which aggressively maximizes the immediate reward, often leads to 'policy collapse'—the model's linguistic quality degrades, and it starts generating repetitive, nonsensical text that happens to exploit the reward model. What is the primary reason for employing an algorithm like Proximal Policy Optimization (PPO) in this scenario?

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Learn Before

Related