1Cademy - Reward Function in Terms of Policy Models and Normalization Factor

Learn Before

Solution to KL Divergence Minimization for Policy Optimization

Formula

Reward Function in Terms of Policy Models and Normalization Factor

By rearranging the equation for the optimal target policy, the underlying reward function $r(\mathbf{x}, \mathbf{y})$ can be expressed solely using the target model $\pi_{\theta}(\mathbf{y}|\mathbf{x})$ , the reference model $\pi_{\theta_{\mathrm{ref}}}(\mathbf{y}|\mathbf{x})$ , and the normalization factor $Z(\mathbf{x})$ . This is a profound shift because, although the initial goal was to learn a policy using a given reward model, it leads to a representation of the reward model derived entirely from the policy. The resulting formula is:

$r(\mathbf{x},\mathbf{y}) = \beta \left(\log \frac{\pi_{\theta}(\mathbf{y}|\mathbf{x})}{\pi_{\theta_{\mathrm{ref}}}(\mathbf{y}|\mathbf{x})} + \log Z(\mathbf{x}) \right)$

Updated 2026-05-03

Contributors are:

Who are from:

References

Learn Before

Related

Learn After