1Cademy - Reward Model as an Imperfect Environment Proxy

Learn Before

Human Preference Alignment via Reward Models

Concept

Reward Model as an Imperfect Environment Proxy

In the context of RLHF, a reward model serves as a substitute, or proxy, for the true environment of human preferences. It provides a quantitative evaluation of an LLM's output. However, since the complexity of human values is immense and not fully knowable, any reward model is inherently an imperfect representation. Consequently, excessively optimizing an LLM's performance against this flawed proxy can paradoxically lead to a decline in its actual quality, a phenomenon referred to as the overoptimization problem.

Updated 2026-05-03

Contributors are: