1Cademy - Critique of a Modified Policy Formulation

Learn Before

Target Policy as a Reward-Weighted Distribution

Essay

Critique of a Modified Policy Formulation

In reinforcement learning from human feedback, a target policy π* is often defined by re-weighting a reference policy π_ref based on a reward r(x, y), as shown in the equation: $\pi^{*}(\mathbf{y}|\mathbf{x}) = \frac{\pi_{\text{ref}}(\mathbf{y}|\mathbf{x}) \exp \left(\frac{1}{\beta}r(\mathbf{x}, \mathbf{y})\right)}{Z(\mathbf{x})}$ A researcher proposes a simplification by removing the reference policy term entirely, creating a new target: $\pi_{\text{new}}^{*}(\mathbf{y}|\mathbf{x}) = \frac{\exp \left(\frac{1}{\beta}r(\mathbf{x}, \mathbf{y})\right)}{Z'(\mathbf{x})}$ Evaluate this proposed simplification. Discuss one potential advantage and two significant disadvantages of using this new formulation to guide a language model's learning process.

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Learn Before

Related