Formula

Reward Function in Terms of Policy Models and Normalization Factor

By rearranging the equation for the optimal target policy, the underlying reward function r(x,y)r(\mathbf{x}, \mathbf{y}) can be expressed solely using the target model πθ(yx)\pi_{\theta}(\mathbf{y}|\mathbf{x}), the reference model πθref(yx)\pi_{\theta_{\mathrm{ref}}}(\mathbf{y}|\mathbf{x}), and the normalization factor Z(x)Z(\mathbf{x}). This is a profound shift because, although the initial goal was to learn a policy using a given reward model, it leads to a representation of the reward model derived entirely from the policy. The resulting formula is:

r(x,y)=β(logπθ(yx)πθref(yx)+logZ(x))r(\mathbf{x},\mathbf{y}) = \beta \left(\log \frac{\pi_{\theta}(\mathbf{y}|\mathbf{x})}{\pi_{\theta_{\mathrm{ref}}}(\mathbf{y}|\mathbf{x})} + \log Z(\mathbf{x}) \right)

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
Learn After