1Cademy - In the context of fine-tuning a language model with reinforcement learning, the optimization objective often includes a penalty term that measures the divergence from an initial reference policy. What is the most critical trade-off this penalty term is designed to manage?

Learn Before

PPO Objective for LLM Training

Multiple Choice

In the context of fine-tuning a language model with reinforcement learning, the optimization objective often includes a penalty term that measures the divergence from an initial reference policy. What is the most critical trade-off this penalty term is designed to manage?

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences