1Cademy - Mechanism of Reward Model Elimination

Learn Before

Elimination of the Reward Model in DPO

Short Answer

Mechanism of Reward Model Elimination

In a particular language model alignment method, the implicit reward for a response y given a prompt x is defined as r(x, y) = β log(π_θ(y|x) / π_ref(y|x)) + β log Z(x), where Z(x) is a normalization factor that is difficult to compute. The training process relies on calculating the difference in rewards between a preferred response y_a and a dispreferred response y_b. Explain what happens to the β log Z(x) term during this difference calculation and why this outcome is the key reason the method can be trained without an explicit reward model.

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Learn Before

Related