Case Study

Diagnosing Undesirable Behavior in Policy Optimization

An engineering team is optimizing a language model's policy by minimizing a loss function, L(x, {y1, y2}, r), where r is a pre-trained model that scores responses based on human preferences. The team observes that while the loss L is consistently decreasing, the language model is increasingly producing overly cautious and generic responses, often refusing to answer harmless questions. Based on the structure of the optimization objective, which component is the most likely source of this undesirable behavior, and why?

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science