1Cademy - Diagnosing LLM Training Instability

Learn Before

PPO Objective Formula for LLM Training in RLHF

Case Study

Diagnosing LLM Training Instability

An AI development team is fine-tuning a large language model using a reinforcement learning approach. They observe that while the model's outputs initially align better with desired criteria, it soon begins to generate repetitive, low-quality, and nonsensical text. The team's policy optimization objective function contains two key parts: 1) a term that encourages actions with high advantage, clipped to prevent overly large updates, and 2) a penalty term, controlled by a coefficient β, that measures how much the current policy has diverged from a stable, initial reference policy.

Given the observed training behavior, which part of the objective function is likely misconfigured? Specifically, is the β coefficient most likely set too high or too low? Justify your reasoning by explaining the role of the penalty term in the training process.

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related