1Cademy - Optimal Policy as a Product of Reference Policy and Exponentiated Reward

Learn Before

Policy Proportional to Exponentiated Reward

Formula

Optimal Policy as a Product of Reference Policy and Exponentiated Reward

In reinforcement learning, particularly for language model alignment, an optimal policy $\pi^*(y|x)$ is defined to be proportional to the product of a reference policy $\pi_{\theta_{\text{ref}}}(y|x)$ and the exponential of a scaled reward function $r(x, y)$ . The unnormalized probability for an output $\mathbf{y}$ given an input $\mathbf{x}$ is given by the expression: $\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}) \exp(\frac{1}{\beta}r(\mathbf{x}, \mathbf{y}))$ This formulation effectively re-weights the probabilities from the reference policy, increasing the likelihood of outputs that receive a higher reward. The parameter $\beta$ acts as a temperature, controlling the influence of the reward function on the final policy. The full policy $\pi^*$ is obtained by normalizing this expression over all possible outputs $\mathbf{y}$ .

0

1

Updated 2025-10-08

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related

Learn After