1Cademy - Diagnosing Undesirable Agent Behavior

Learn Before

Reward Models as the Basis for Value Functions

Case Study

Diagnosing Undesirable Agent Behavior

An AI agent is trained using reinforcement learning to generate helpful and harmless summaries of news articles. After deployment, it is observed that while the summaries are factually accurate, they are consistently written in an alarming and emotionally inflammatory tone, causing user distress. Based on this observation, what is the most likely deficiency in the agent's training process that led to this undesirable behavior? Explain your reasoning by connecting the components of the learning framework.

Updated 2025-10-06

Contributors are:

Who are from:

Learn Before

Related