Formula

Simplified Policy Optimization Objective as KL Divergence Minimization

The policy optimization objective can be mathematically simplified to finding the parameters θ~\tilde{\theta} that minimize the expected Kullback-Leibler (KL) divergence between the learned target policy πθ\pi_{\theta} and the optimal target distribution π\pi^*. This simplification is mathematically sound because the normalization term, logZ(x)\log Z(\mathbf{x}), is independent of the optimization variable θ\theta and can therefore be removed from the arg minθ\argmin_{\theta} operation without altering the optimal parameters. The simplified training objective is expressed as:

θ~=arg minθExD[KL(πθ(x)  π(x))]\tilde{\theta} = \argmin_{\theta} \mathbb{E}_{\mathbf{x} \sim \mathcal{D}} \Big[ \mathrm{KL} \big(\pi_{\theta}(\cdot|\mathbf{x})\ ||\ \pi^{*}(\cdot|\mathbf{x}) \big) \Big]

Image 0

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences