Case Study

Diagnosing Undesirable Agent Behavior

An AI agent is trained using reinforcement learning to generate helpful and harmless summaries of news articles. After deployment, it is observed that while the summaries are factually accurate, they are consistently written in an alarming and emotionally inflammatory tone, causing user distress. Based on this observation, what is the most likely deficiency in the agent's training process that led to this undesirable behavior? Explain your reasoning by connecting the components of the learning framework.

0

1

Updated 2025-10-06

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science