Formula

Target Policy as a Reward-Weighted Distribution

In policy optimization frameworks like RLHF, the target policy π_θ that is being learned is defined as being equal to an optimal distribution π*. This optimal distribution is created by re-weighting a reference policy π_{θ_ref} according to a reward function r(x, y). The complete relationship is expressed by the formula: πθ(yx)=π(yx)=πθref(yx)exp(1βr(x,y))Z(x)\pi_{\theta}(\mathbf{y}|\mathbf{x}) = \pi^{*}(\mathbf{y}|\mathbf{x}) = \frac{\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}) \exp \left(\frac{1}{\beta}r(\mathbf{x}, \mathbf{y})\right)}{Z(\mathbf{x})} This equation establishes the ideal policy that the model, parameterized by θ, aims to learn, balancing adherence to the reference model with maximization of the reward.

Image 0

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences