1Cademy - Analyzing the Regularization Parameter in Policy Optimization

Learn Before

Conceptual Objective Function Assumed in DPO

Essay

Analyzing the Regularization Parameter in Policy Optimization

A language model's policy, $\pi_\theta$ , is being optimized by minimizing the objective function below. In this function, $r(\mathbf{x}, \mathbf{y})$ represents a reward for generating output $\mathbf{y}$ from input $\mathbf{x}$ , and $\pi_{\theta_{\text{ref}}}$ is a fixed reference policy.

$\min_{\theta} \mathbb{E}_{\mathbf{x} \sim \mathcal{D}, \mathbf{y} \sim \pi_{\theta}} [-r(\mathbf{x}, \mathbf{y}) + \beta (\log \pi_{\theta}(\mathbf{y}|\mathbf{x}) - \log \pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}))]$

Analyze the trade-offs involved when setting the hyperparameter $\beta$ to a very high value versus a very low (but non-zero) value. Describe the likely characteristics of the resulting model's behavior in each scenario.

0

1

Updated 2025-09-28

Contributors are:

Who are from:

Learn Before

Related