Multiple Choice

A team is fine-tuning a language model using a reinforcement learning process. In each step, the model generates a response to a prompt, a separate reward model scores the response, and the language model's parameters are updated based on this score. The team finds that a simple update rule, which aggressively maximizes the immediate reward, often leads to 'policy collapse'—the model's linguistic quality degrades, and it starts generating repetitive, nonsensical text that happens to exploit the reward model. What is the primary reason for employing an algorithm like Proximal Policy Optimization (PPO) in this scenario?

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.4 Alignment - Foundations of Large Language Models

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science