Distinguishing Model Outputs in Preference Alignment
In a system that uses reinforcement learning to align a language model with human preferences, two key components are a 'reward model' and a 'value model'. Both often share a similar underlying architecture, taking a sequence of text as input and producing a single scalar number as output. Explain the fundamental difference between what the scalar output of the reward model represents versus what the scalar output of the value model represents.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
In a system designed to align a language model with human preferences, one component functions as a 'critic'. It takes the current state (e.g., a conversation history) as input and outputs a single scalar value predicting the total expected future rewards from that state. This component's architecture is often a large language model with a final linear layer for the scalar output. Which statement best distinguishes this specific component from others in the system?
Distinguishing Model Outputs in Preference Alignment
Diagnosing a Reinforcement Learning System