1Cademy - Simplified Policy Optimization Objective as KL Divergence Minimization

Learn Before

Derivation of the KL Divergence Objective for Policy Optimization

Formula

Simplified Policy Optimization Objective as KL Divergence Minimization

The policy optimization objective can be mathematically simplified to finding the parameters $\tilde{\theta}$ that minimize the expected Kullback-Leibler (KL) divergence between the learned target policy $\pi_{\theta}$ and the optimal target distribution $\pi^*$ . This simplification is mathematically sound because the normalization term, $\log Z(\mathbf{x})$ , is independent of the optimization variable $\theta$ and can therefore be removed from the $\argmin_{\theta}$ operation without altering the optimal parameters. The simplified training objective is expressed as:

$\tilde{\theta} = \argmin_{\theta} \mathbb{E}_{\mathbf{x} \sim \mathcal{D}} \Big[ \mathrm{KL} \big(\pi_{\theta}(\cdot|\mathbf{x})\ ||\ \pi^{*}(\cdot|\mathbf{x}) \big) \Big]$

0

1

Updated 2026-05-03

Contributors are:

Who are from:

References

Learn Before

Related

Learn After