Learn Before
Reward Model as an Imperfect Environment Proxy
In the context of RLHF, a reward model serves as a substitute, or proxy, for the true environment of human preferences. It provides a quantitative evaluation of an LLM's output. However, since the complexity of human values is immense and not fully knowable, any reward model is inherently an imperfect representation. Consequently, excessively optimizing an LLM's performance against this flawed proxy can paradoxically lead to a decline in its actual quality, a phenomenon referred to as the overoptimization problem.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.2 Generative Models - Foundations of Large Language Models
Related
Reward Model as an Imperfect Environment Proxy
Direct Policy Optimization (DPO) Training Process
Comparison of RLHF and DPO Training Pipelines
Limitations of Human Feedback for LLM Alignment
An AI development team aims to align a large language model to be more helpful. They create a dataset where, for a given prompt, they collect two different responses from the model and have human annotators label which of the two responses is superior. What is the primary and most direct function of this specific type of dataset in a human preference alignment methodology?
A development team is refining a large language model to be more helpful and harmless. They are using a method that involves learning from human judgments about which of two responses is better. Arrange the following three core stages of this alignment process into the correct chronological order.
Insufficiency of Data Fitting for Complex Value Alignment
Comparison of AI Feedback and Human Feedback for LLM Alignment
Outcome-Based Reward Models
AI Chatbot Alignment Strategy
Learn After
Overoptimization Problem in Reward Modeling (Reward Hacking or Reward Gaming)
A team is training a large language model using a scoring function derived from human preference data. They observe that after a certain point, continuing to train the model to maximize its score leads to a decrease in the actual quality of its responses as judged by human evaluators. What is the most fundamental reason for this phenomenon?
Divergence in LLM Performance
The Paradox of Optimization in Reward Modeling