Learn Before
Case Study

Diagnosing and Mitigating Reward Hacking

Based on the standard training process for language models fine-tuned with human feedback, what specific component is designed to prevent the kind of extreme behavioral change described in the case study below, and how does it function to counteract the model's tendency to over-optimize for the flawed reward signal?

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science