Formula

Optimal Policy as a Product of Reference Policy and Exponentiated Reward

In reinforcement learning, particularly for language model alignment, an optimal policy π(yx)\pi^*(y|x) is defined to be proportional to the product of a reference policy πθref(yx)\pi_{\theta_{\text{ref}}}(y|x) and the exponential of a scaled reward function r(x,y)r(x, y). The unnormalized probability for an output y\mathbf{y} given an input x\mathbf{x} is given by the expression: πθref(yx)exp(1βr(x,y))\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}) \exp(\frac{1}{\beta}r(\mathbf{x}, \mathbf{y})) This formulation effectively re-weights the probabilities from the reference policy, increasing the likelihood of outputs that receive a higher reward. The parameter β\beta acts as a temperature, controlling the influence of the reward function on the final policy. The full policy π\pi^* is obtained by normalizing this expression over all possible outputs y\mathbf{y}.

Image 0

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences