Essay

Analyzing the Regularization Parameter in Policy Optimization

A language model's policy, πθ\pi_\theta, is being optimized by minimizing the objective function below. In this function, r(x,y)r(\mathbf{x}, \mathbf{y}) represents a reward for generating output y\mathbf{y} from input x\mathbf{x}, and πθref\pi_{\theta_{\text{ref}}} is a fixed reference policy.

minθExD,yπθ[r(x,y)+β(logπθ(yx)logπθref(yx))]\min_{\theta} \mathbb{E}_{\mathbf{x} \sim \mathcal{D}, \mathbf{y} \sim \pi_{\theta}} [-r(\mathbf{x}, \mathbf{y}) + \beta (\log \pi_{\theta}(\mathbf{y}|\mathbf{x}) - \log \pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}))]

Analyze the trade-offs involved when setting the hyperparameter β\beta to a very high value versus a very low (but non-zero) value. Describe the likely characteristics of the resulting model's behavior in each scenario.

0

1

Updated 2025-09-28

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science