Case Study

Diagnosing LLM Training Instability

An AI development team is fine-tuning a large language model using a reinforcement learning approach. They observe that while the model's outputs initially align better with desired criteria, it soon begins to generate repetitive, low-quality, and nonsensical text. The team's policy optimization objective function contains two key parts: 1) a term that encourages actions with high advantage, clipped to prevent overly large updates, and 2) a penalty term, controlled by a coefficient β, that measures how much the current policy has diverged from a stable, initial reference policy.

Given the observed training behavior, which part of the objective function is likely misconfigured? Specifically, is the β coefficient most likely set too high or too low? Justify your reasoning by explaining the role of the penalty term in the training process.

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science