Learn Before
Concept

Conceptual Reward Model in DPO's Training Objective

The Direct Policy Optimization (DPO) method is founded on a training objective that conceptually relies on a reward model, r(x, y), to assess the quality of a given output y for an input x. Although DPO's final formulation bypasses the need for an explicitly trained reward model, this underlying assumption of a reward function is a critical starting point for its derivation.

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related