Essay

Impact of Reward Model Flaws on Value Function Estimation

An agent is being trained to navigate a maze. Its reward model is designed to give a small positive signal for each step taken that does not hit a wall, and a large positive signal for reaching the exit. However, due to a flaw, the model also provides a moderately high positive signal for moving into a specific dead-end corridor. Analyze the likely effect of this flaw on the agent's computed long-term value for states within and near this corridor. How might this flawed value estimation, in turn, influence the agent's final learned path through the maze?

0

1

Updated 2025-09-28

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science