1Cademy - Target Policy as a Reward-Weighted Distribution

Learn Before

Reward-Weighted Probability Distribution

Formula

Target Policy as a Reward-Weighted Distribution

In policy optimization frameworks like RLHF, the target policy π_θ that is being learned is defined as being equal to an optimal distribution π*. This optimal distribution is created by re-weighting a reference policy π_{θ_ref} according to a reward function r(x, y). The complete relationship is expressed by the formula: $\pi_{\theta}(\mathbf{y}|\mathbf{x}) = \pi^{*}(\mathbf{y}|\mathbf{x}) = \frac{\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}) \exp \left(\frac{1}{\beta}r(\mathbf{x}, \mathbf{y})\right)}{Z(\mathbf{x})}$ This equation establishes the ideal policy that the model, parameterized by θ, aims to learn, balancing adherence to the reference model with maximization of the reward.

0

1

Updated 2026-05-03

Contributors are:

Who are from:

References

Learn Before

Related

Learn After