Short Answer

Distinguishing Model Outputs in Preference Alignment

In a system that uses reinforcement learning to align a language model with human preferences, two key components are a 'reward model' and a 'value model'. Both often share a similar underlying architecture, taking a sequence of text as input and producing a single scalar number as output. Explain the fundamental difference between what the scalar output of the reward model represents versus what the scalar output of the value model represents.

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science