1Cademy - Analysis of Reward Function under Policy Convergence

Learn Before

Reward Function in Terms of Policy Models and Normalization Factor

Short Answer

Analysis of Reward Function under Policy Convergence

In a language model alignment framework, the reward for generating a response y to a prompt x is given by the equation: $r(\mathbf{x}, \mathbf{y}) = \beta \left( \log \frac{\pi_{\theta}(\mathbf{y}|\mathbf{x})}{\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x})} + \log Z(\mathbf{x}) \right)$ where $\pi_{\theta}$ is the target policy, $\pi_{\theta_{\text{ref}}}$ is the reference policy, $\beta$ is a positive constant, and $Z(\mathbf{x})$ is a normalization factor that depends only on the prompt x. Suppose that for a given prompt x, the target policy becomes identical to the reference policy for all possible responses (i.e., $\pi_{\theta}(\mathbf{y}|\mathbf{x}) = \pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x})$ for every y). What does this imply about the reward $r(\mathbf{x}, \mathbf{y})$ for any response y? Explain your reasoning by analyzing the components of the equation.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related