Concept

Reward Model as an Imperfect Environment Proxy

In the context of RLHF, a reward model serves as a substitute, or proxy, for the true environment of human preferences. It provides a quantitative evaluation of an LLM's output. However, since the complexity of human values is immense and not fully knowable, any reward model is inherently an imperfect representation. Consequently, excessively optimizing an LLM's performance against this flawed proxy can paradoxically lead to a decline in its actual quality, a phenomenon referred to as the overoptimization problem.

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.2 Generative Models - Foundations of Large Language Models