1Cademy - Diagnosing Undesirable Behavior in Policy Optimization

Learn Before

RLHF Policy Optimization as Loss Minimization

Case Study

Diagnosing Undesirable Behavior in Policy Optimization

An engineering team is optimizing a language model's policy by minimizing a loss function, L(x, {y1, y2}, r), where r is a pre-trained model that scores responses based on human preferences. The team observes that while the loss L is consistently decreasing, the language model is increasingly producing overly cautious and generic responses, often refusing to answer harmless questions. Based on the structure of the optimization objective, which component is the most likely source of this undesirable behavior, and why?

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related