1Cademy - Reward-Weighted Probability Distribution

Learn Before

Normalization Factor for a Reward-Weighted Policy

Formula

Reward-Weighted Probability Distribution

A reward-weighted probability distribution, denoted as $\pi^{*}(\mathbf{y}|\mathbf{x})$ , is a new distribution created by modifying a reference distribution, $\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x})$ , based on a reward signal, $r(\mathbf{x}, \mathbf{y})$ . The reference distribution's probability for each output $\mathbf{y}$ is scaled by an exponential factor of the reward. The entire expression is then normalized by the partition function $Z(\mathbf{x})$ to ensure it sums to one and is a valid probability distribution. The formula is: $\pi^{*}(\mathbf{y}|\mathbf{x}) = \frac{\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}) \exp \left(\frac{1}{\beta}r(\mathbf{x}, \mathbf{y})\right)}{Z(\mathbf{x})}$

0

1

Updated 2025-10-08

Contributors are:

Who are from:

References

Learn Before

Related

Learn After